#sql

    Your customer data, together at last
    April 06, 2016 | Andy Granowitz

    Customers interact with your company through many paths — they engage inside your product, open emails, view ads, answer surveys, etc. To understand your customers, you have to combine product engagement with their other interactions.

    Unfortunately, it’s difficult to analyze customer data across these different experiences. Customer behavior in your product is usually measured through a 3rd party tracking service or by analyzing raw application logs. Other interactions, like as customer support chats or sales calls, are stored in SaaS apps or homegrown internal tools. You can only access this data through each of their limited reporting UIs, exports, or possibly an API. There’s no magic wand for joining these disparate streams.

    There are two common approaches to join these multiple datasets: (1) send all data to one tracking service or (2) move the data to a common place. Sending all the data to a single tracking service, like Google Analytics or Mixpanel, is cumbersome, nearly impossible to implement, and inflexible to query. Dead end.

    The better approach is to move all customer data into a common store. Amazon Redshift, Google BigQuery, and Azure SQL Warehouse are our favorite cloud data stores for large scale analytics. They’re fast, easy to manage, and decreasing in price. But moving data from 3rd party sources to these cloud databases is still hard. You have to work with multiple APIs, find join keys, schedule data pulls, handle backfill, etc… It’s a huge headache.

    There are two strategies: build or buy. Building your own is complicated - we have a post about it!


      💖  


    Today, thanks to Segment Sources, bringing all of your customer touch points together got easier. They’ve made it easy to pull customer data from many sources, such as Salesforce, Zendesk, and behavioral data from your app or website, into a single warehouse. We worked with Segment to write some starter queries - give them a go. If you’re already using Wagon and connect a Segment Warehouse, you can open these queries right in the app.

    Segment will be expanding its Source catalog in the coming months, and you can always check out the current catalog to see the latest additions. We’re excited to partner with Segment on our shared mission to make answering questions with data easy.

    We also recommend other tools to help unify your various customer data sets: Fivetran, RJ Metrics Pipeline, and Snowplow – so that you’ll always have access to the data you need in Wagon, no matter how you choose to set up your pipeline.

    Modern ETL + modern SQL tools = win!

    Python UDFs in Amazon Redshift
    February 01, 2016 | Elise Breda (Yhat)

    Elise Breda is the Growth Strategist at Yhat, a tech company that builds software for deploying predictive analytics to mobile and web apps. When she isn’t at their awesome new office in DUMBO, she can be found exploring the bike paths and Thai restaurants of NYC.

    In the Beginning…

    In the early 2000’s, the behemoth collection of cloud computing services we now know as Amazon Web Services (AWS) was little more than sparks firing in Chris Pinkham and Benjamin Black’s neurons. In 2003, the two presented a paper (blog post here) outlining a radical vision for a retail computing infrastructure that would be standardized, automated and built upon web services. The next year, Simple Queue Service, the first AWS service for public usage, was launched.

    Fast forward almost a decade, and AWS is now the most commonly used cloud platform among enterprise software developers. AWS products span the gamut of web services, from computation (eg EC2) to networking (eg VPC) and content delivery (eg S3). In this post we’ll explore a small fraction of a fraction of the AWS ecosystem–a database that’s generating all kinds of groundswell right now: Amazon Redshift.

    The rest of this post will talk about Redshift at a high level and then dive into a mini overview of User Defined Functions (UDFs), how they work, why they’re great, and how to start using them.

    Amazon Redshift

    Amazon Redshift is a hosted data warehouse that’s accessible / easy to set up, and built for speed and suitable for a variety of combining, storing, and compute-heavy analytics tasks.

    Two things make Redshift particularly attractive. First, Redshift can handle insane amounts of data–it is a petabyte-scale warehouse. A petabyte is a lot (1015 bytes) of data. As a point of reference, the entire master catalog of Netflix video in 2013 amounted to about 3.14 petabytes of storage space (interesting read on Quora). Second, unlike Amazon’s other hosted database product, Amazon RDS, Redshift stores data according to column-based structure. Column orientation is good for tables containing columns with lots of repeated values (i.e. Credit Card Names, County/State, Product Type, etc, like CRM data. The benefit of column data is that because it’s uniform, there are opportunities for storage size optimization via compression. You can read more about how to maximize compression here.

    Column orientation compression

    Redshift handles large scale column-oriented datasets using massive parallel processing, performing coordinated computations across a large number of processors in parallel, making it a fast and powerful data warehouse option.

    Data Warehouse Setup in the Data Age

    Even just a few years ago, getting a data warehouse and proper ETL processes in place was a long, painful and probably very expensive ordeal. But we’ve arrived in the data age where easy-to-use, affordable data solutions are bountiful.

    At Yhat, we use a Redshift to warehouse everything–CRM data (we use SFDC), product data, site metrics from Google Analytics, and data from a bunch of other data. It took us about 20 mins to set up the database on AWS, and it took us…wait for it…another 20 mins or so to set up all of our ETL using Fivetran which we couldn’t be more impressed with.

    SQL IDE Done Right

    Most SQL IDEs of yesteryear leave something to be desired in terms of UX. The majority are clunky and have super old school frankenstein UIs. Why they all focus on making exploring the DB schema rather than on making it easy to write queries, view results and think critically about your data has always been a mystery.

    Well those days are also over. Wagon is the query-focused SQL app I’ve been looking for for years. Wagon boasts a clean UX designed analysts. Features are carefully chosen with a keen eye for usability for people writing tens or hundreds of queries per day. Wagon gets it in spades.

    Overview of Python UDFs in Redshift

    UDF stands for user-defined function, meaning that you can add functions to an environment (in this case, Redshift) in addition to those that come built in. Python UDFs allow you combine the power of Redshift with what you know and love about the Python programming language without switching between IDEs or systems.

    The great thing about UDFs in Redshift is that Redshift will automatically execute it using its MPP architecture. One caveat to keep in mind is that your Python code still won’t execute as quickly as native SQL functions (AVG, MIN, MAX, etc.) that are baked into the database.

    How to Use UDFs

    You can certainly work with text in pure SQL, but some tasks are just easier to do in a scripting language like Python instead. Here’s a toy example to illustrate how to use Python functionality within Redshift using a UDF.

    Suppose a column in one of our tables contains huge chunks of text or html, and we’re interested to find any email addresses within any one record. Let’s write a function that will take in raw text and return a pipe | separated string containing any email addresses found within the input text document. Define the function like so:

    Once defined, you can use it like this:

    This is a scalar function, so it’ll return one record for each input row (i.e. not an aggregate function). One thing to remember is that your UDFs are per-database, meaning that if you have multiple in your Redshift cluster, you’ll need to define your functions in each database.

    Example

    Redshift Python UDFs are based on Python 2.7 and come preloaded with a lot of our favorite libraries, including NumPy, SciPy and Pandas. You can also import custom modules from S3 and the web.

    Here’s the template published on the AWS blog that you can use to start creating your own scalar functions:

    The scalar UDFs that you create will return a single result value for each input value. Once you’ve defined a UDF, you can use it in any SQL statement. One thing to remember is that your UDFs are per-database, meaning that if you have multiple in your Redshift cluster, you’ll need to define your functions in each database.

    Helpful Resources

    To learn more about Python UDFs in Redshift, check out Amazon’s documentation, which is super helpful and covers everything from constraints to security and python support. You can also check out the initial release blogpost and a more extensive post that uses UDFs to analyze the CMS Open Payments data set.

    Yhat

    Yhat’s flagship product, ScienceOps, empowers teams of data scientists deploy their models into web and mobile applications. These models are embedded into production applications via REST APIs without any recoding from their native statistical language.

    Calculating Active Users in SQL
    November 13, 2015 | Andy Granowitz

    How engaged are your users? How frequently do they visit your website or app? Analytics services like Google Analytics and MixPanel calculate basic counts of daily, weekly, and monthly active users, but it’s difficult to customize or join these results with other data. Writing this query in SQL gives you more control. Let’s do it!

    Here’s a table of user logins by day. How many users were active in the last week and month?

    date user_id num_logins
    10/1/15 1 3
    10/1/15 2 null
    10/1/15 3 1
    10/2/15 1 null
    10/2/15 2 1
    10/2/15 3 3

    Like calculating running totals, there are two approaches: self join or window function.

    In either approach, it’s helpful to have a table of logins per user for each day, even if the user didn’t login (null in this example). If your data isn’t already organized like this, you can generate a table with a row per day, per user, with the following query (this is Postgres syntax, for other databases, modify the generate_series function to generate a table of dates).

    To use this data, you can create a temporary table, use a common table expression, or include it as a subselect.

    Approach 1: Self Join

    A self join is when you join a table with itself. How meta is that? For each row, we ask how many logins that user had in the last week. The join condition requires emails to match and for the date to be within the last 7 days. In line 5, the query sums num_logins for those dates. The case statement identifies the user as active on that day if she had any logins in the prior week.

    This query generates a table that tells us which users are seven-day-active over time. This result can be aggregated further, filtered for specific dates, used to find inactive users, and joined with other data. In Wagon, we can create a graph of the number of 7 day active users over time.

    Approach 2: Window Functions

    The self join works great, but modern databases have a more efficient way to get the same results. With window functions, we can explicitly aggregate only over rows that we care about with just a single pass through the data. If you have millions or billions of rows (lucky you), the self join will take a long time to compute. In line 5, the query sums num_logins for the user’s previous 14 days. It first partitions the table by email, then evaluates over a set of rows - in this case we’re looking at a specific date range. The case statement classifies the user as active or not just as before.

    This query makes it easier to add additional metrics for 7 and 30 day active users. As expected, the wider your definition of active user, the more you’ll have. Use these new powers carefully!


    Want to learn more SQL? Join us on Monday, November 16 at the Wagon office in San Francisco for a free SQL workshop. Please RSVP!

    Calculating Running Totals using SQL
    October 20, 2015 | Andy Granowitz

    How many users joined in the last 5 months? What were total sales in Q2? How much revenue came from the March sign up cohort?

    Although these questions can be answered with a single number, it can be useful to see a running total over time: how many unique users joined, or how much cumulative revenue was received by day over some period.

    Usually, data is stored incrementally. For example, here’s a table of sales per day:

    Date Sales
    10/1/2015 5
    10/2/2015 3
    10/3/2015 7
    10/4/2015 8
    10/5/2015 2
    10/6/2015 3
    10/7/2015 6

    How do we generate the following table of cumulative sales over time? In SQL, there are two typical approaches: a self join or a window function.

    Date Running Total of Sales
    10/1/2015 5
    10/2/2015 8
    10/3/2015 15
    10/4/2015 23
    10/5/2015 25
    10/6/2015 28
    10/7/2015 34

    A self join is a query that compares a table to itself. In this case, we’re comparing each date to any date less than or equal to it in order to calculate the running total. Concretely, we take the sum of sales in the second table over every row that has a date less than or equal to the date coming from the first table. This is Postgres/Redshift syntax, but other SQL dialects are very similar.

    This is not a bad approach; it is a nice showcase of how extensible SQL can be using only select, from, join, and group by statements.

    But it is a lot of code for a simple task. Let’s try a window function. They are designed to calculate a metric over a set of rows. In our case, we want to sum every row where the date is less than or equal to the date in the current row.

    The window function can filter and arrange the set of rows to run the function over. Here the order by date rows unbounded preceding limits the sum function to only sales before the date of the current row. Window functions are incredibly useful for time-based analytical queries; to learn more, the Postgres docs are a great place to start.

    The final step of creating a chart and sharing it triumphantly with your teammates is easily accomplished using Wagon. Window functions for the win!

    Wagon is a modern SQL editor for analysts and engineers: write queries, visualize data, and share charts with your team. Signup for free:

    Migrating from SQL Workbench/J to Wagon
    September 30, 2015 | Andy Granowitz

    Many Wagon users previously used SQL Workbench/J to query Amazon Redshift. Older SQL tools are focused on DBA tasks like managing tables, updating schemas, and provisioning users. Analysts just want a simple way to query data, analyze it, visualize it, and collaborate with others. It’s no surprise that we’re frequently asked how to move from legacy tools like SQL Workbench/J to Wagon. It’s super easy.

    If you are currently using SQL Workbench/J and want to try Wagon, here are the quick steps to connect to Redshift in Wagon:

    1. In SQL Workbench/J, open the connection window
    2. Grab the hostname, port, and database from the URL, the username, and the password (in the Redshift interface, the URL is called the JDBC URL)
    3. Paste into Wagon (no need to install any drivers!)

    SQL Workbench/J connect window


    Happy querying!

    Dear AWS Big Data Blog
    August 13, 2015 | Andy Granowitz

    We love Redshift and we love R. So we were delighted to see an AWS post about how to connect Redshift to R. Petabytes of data plus all the statistical models you can imagine? I’m in!

    Wagon is a great way to use Redshift.

    Unfortunately, their recommended setup instructions are awfully cumbersome, in large part due to the 12 unfriendly steps required to connect SQL Workbench/J. In Wagon, connecting to Redshift is one step and requires no complex configuration: just copy and paste your connection details. Now, you’re ready to run those same Redshift queries in Wagon. SQL Workbench/J no longer needed.

    While R offers an incredible collection of statistics and visualization libraries, it can often be more than you need for basic exploration and analysis. Also, many analysts find R to be overwhelming and unneeded for most their work. In fact, the data manipulation and visualization in the blog post of flight delays by month can be recreated in Wagon with one query and a quick drag and drop chart. You don’t need to be a command line wizard, just a little SQL curious!

    Hopefully this post saves you time when you’re interacting with Redshift in R, or even if you are looking to run some more custom queries against your Redshift cluster. Sign up for early access to Wagon if you want try for yourself. Gogogo!

    Oh! And, we’re working on a deep integration with R. If you have strong opinions about it, join us at band.wagonhq.com

    Building an Analytics Pipeline in 2015
    August 06, 2015 | Andy Granowitz

    Every company needs an analytics data platform queryable by SQL.

    Using a single analytics tool or relying on logs from a single source is a fast way to get started but is rarely sufficient. You’ll realize you need a better data strategy when attempting more detailed analytics tasks: cohorting customers based on segments available in multiple data sources, analyzing long time scale trends, or making data available to other applications. Unfortunately, you’ll quickly reach the limit of your off-the-shelf tool.

    There has been a dramatic increase in data being created and fascination with Big Data, but less of a corresponding uptick in how to capture its value. Engineering a system to ingest, transform, and process data from many (changing, flaky) sources has been a long time, Very Hard Problem™. Doing this well requires hard work – the dreaded ETL.

    We see more and more companies choosing to invest in SQL warehouses and the requisite engineering well before they become large businesses. How do you effectively build one of these things? And what makes building robust analytics infrastructure difficult?


    Google Trends for Big Data vs. ETL See full Google Trends report


    Here’s an example illustrating the core problems: You implemented a new purchase flow in your app and you’d like to understand conversion rates (tracked from logs) broken down by your email marketing A/B test (tracked from a 3rd party). The log lines you’re generating have new structure and may need to be re-parsed to fit into your existing schema. The A/B testing info may live in a different place than user data. Boiler plate reporting tools and drag and drop analytics UIs are great, but they require structuring ahead of time and the new checkout flow change is already live in production. Manually doing this analysis one time is annoying, but turning it into a reliable, repeatable practice is nearly impossible without dedicated engineering effort.

    Your goal should be to get your data into a data warehouse that can be queried directly by people and programs. While it’s not straightforward, it’s important to understand the pieces. We see companies addressing this problem by focusing on the following steps:

    1. For each data source: generate, collect, and store your data
    2. Transform data into usable, queryable form
    3. Copy multiple sources into a single place
    4. Enjoy the data fruits of your data labor

    The first step is collecting the data with as much structure as possible. You need to generate the data, transmit it from apps, browsers, or services for collection, and then safely store it for later. Many mobile and web analytics providers offer these three steps, others focus on a subset. For example, Heap and Mixpanel generate many app usage events automatically. Others focus on receiving data and making it available to read later (Keen and Splunk as different examples). Segment takes advantage of the difficulty of logging to many places by transmitting data to many of the above services with one API call.

    Another large source of data is logs (usually messy and unstructured). Just having logs is not enough - it must be massaged into usable rows and columns. Some log lines help engineers analyze technology performance or debug errors, some log lines must be interpreted to signal “human” events, and some log lines have been around for so long that no one remembers why they’re there. Logs are rarely generated with their end purpose or a fixed type system in mind. Transformation of these raw strings is necessary to make them usable rather than just searchable.

    For example, you may need to combine three separate log lines in order to signal a successful-user-flow, or to compare log lines against prior data to understand if a user is new, active or re-activated. Or maybe you need to remove those extra pesky spaces around your beautiful integers or standardize timestamps across timezones. Trifacta, Paxata, and Tamr offer technical tools for transforming ugly log forms to structured rows and columns. Or you’ll roll your own.


    Dilbert Cartoon


    Once data collection systems are in place, you want to get this data flowing into a data warehouse. While some of the aforementioned tools provide their own interface for accessing collected and processed data, joining across multiple sources is difficult if not impossible, and their interfaces are often inflexible and cumbersome. Luckily, many of these services recognize this, and offer easy exports to data warehouses. Amplitude and Segment do this well and offer straightforward exports to Redshift. Google Analytics offers export to BigQuery for Premium customers (~$150k / year). Others make it possible, but require a bit of work (for example, Keen). New startups like Textur and Alooma are working on plumbing data into hosted RDBMS systems.

    Outside of dedicated analytics solutions, you often have data from third party sources you’d like to join and analyze (e.g. Salesforce, ZenDesk, MailChimp, etc.). Most of these tools offer APIs to extract data. Building and maintaining these connections from 3rd parties to your data warehouse on your own is doable, but this is often where data integration tools are helpful. Services like Snowplow and Fivetran help.

    At this point, data is flowing in a structured way. But where is it going? When assessing a data warehouse, look for:

    1. Large scale ingestion and storage
    2. Fast computation and multi-user concurrency
    3. Easy management and scaling with data
    4. Security and permissioning

    There are many that meet these criteria: Amazon Redshift, Google BigQuery, Microsoft Azure SQL Data Warehouse, Hive, Spark, Greenplum, Vertica, Impala (the list goes on!). The largest technology companies (Amazon, Google, Microsoft) are investing in, and subsidizing, these data warehousing solutions. It’s a crucial component to nearly every business, which naturally draws the attention of the tech titans.

    Big Data landscape diagram It’s a data jungle out there. Diagram from 2014 by Matt Turck at FirstMark



    Phew, now you can enjoy the freedom of querying structured data, and the work (and fun!) begins. We’ll have more on the data analysis step soon!

    We’d love to hear how you’re tackling these problems. What are your favorite tools? Where are your most painful pains? Tweet at @WagonHQ or send us a note at hello@wagonhq.com!

    Querying CSVs in Wagon
    June 08, 2015 | Andy Granowitz

    People use Wagon to query data stored in databases, but sometimes they need to analyze a file that doesn’t (yet) live in a database. Our friend Steve Pike recently showcased this hack in our #bandwagon Slack channel. He used a nifty command line tool called csvsql to automatically create a Postgres table from a CSV file and have it ready for Wagon.

    CSVs in Wagon

    Here’s how to set it up on Mac:

    1. Install Postgres
    2. Install csvkit
    3. Load your CSV file into Postgres:

       csvsql --db postgres://localhost:5432/postgres --insert --tables mytable /myfile.csv
      
      • mytable is the name of the new table
      • postgres is the name of the database, normally available by default
    4. Open Wagon and connect to your database: (Don’t have access to Wagon? Sign up for early access!)
      • Nickname: csvsql
      • Type: Postgres
      • Hostname: localhost
      • Port: 5432
      • Database: postgres
      • User: [your computer’s username]
      • Password: [empty]

    Woo! You can now write SQL against your CSV file using Wagon.

    Try this out with your next CSV file or an example dataset of movie scenes filmed in San Francisco (source). Our team favorite Blue Jasmine isn’t number 1!

    Wagon is a great way to analyze databases and now small text files. Thanks csvsql.

    Need help with this quick hack? Email us at hello@wagonhq.com.

    Weekly Roundup: Great SQL Posts
    March 02, 2015 | Brant Gidwitz

    We’re always on the lookout for great articles about SQL, analytics, and visualization. Here’s a roundup of the team’s favorite recent posts. Enjoy!

    1. Embracing SQL In Postgres
    Rob Conery dives into Postgres' rich features including regex, full text search, series generation, date math, and window functions.

    2. Modern SQL in PostgreSQL [and other databases]
    Markus Winand presented a talk about lateral joins, common table expressions, with recursive, filters, and fetch first. Do you know about fetch first?!

    3. Tree structure query with PostgreSQL
    Learn how to use WITH RECURSIVE to query hierarchical data from Trường TX.

    4. Tinkering with Marcel (Pt. 1)
    Awesome adaptation of Tom Tango's Marcel Projection system for baseball projections implemented in PostgreSQL. Hats off to @mattdennewitz!

    These posts showcase SQL’s power and versatility. While the basics may be simple, it’s hard to master the language. We love seeing people carefully explain the complex topics. What are your favorite articles and blogs for learning SQL and exploring data?