Wagon Joins Box
August 31, 2016 | Team Wagon

We’re excited to share that the Wagon team will be joining the Box family! We are thrilled to bring our data analytics and insights expertise to the Box enterprise platform.


INSERT INTO "Box" VALUES ('Wagon');

Wagon and Box believe that shared knowledge empowers individuals and unites teams. Box helps enterprises do more with their content and get their work done faster by acting as the central, modern content platform. Fortune 500 companies across every industry trust Box to sit at the center of their businesses. We’re excited to build data analytics products at Box to help people understand their data and work better, together.

Our team will be focused on building new products at Box, so we’ll be shutting down the current Wagon product on October 3rd, 2016. Please read our transition FAQ for details.

We started Wagon over two years ago to help teams collaborate on data analysis. Tens of thousands of SQL savvy people use Wagon to understand their data. We’re proud of what we’ve accomplished and are ready to continue building at Box.

We’re humbled by the support from our users, partners, investors, advisors, friends (#bandwagon), and family. Thank you for your feedback, bug reports, demands, and kind words.

Please read Box CEO Aaron Levie’s blog post.

We'll be rooting you all on. Gogogo!
Team Wagon

Your customer data, together at last
April 06, 2016 | Andy Granowitz

Customers interact with your company through many paths — they engage inside your product, open emails, view ads, answer surveys, etc. To understand your customers, you have to combine product engagement with their other interactions.

Unfortunately, it’s difficult to analyze customer data across these different experiences. Customer behavior in your product is usually measured through a 3rd party tracking service or by analyzing raw application logs. Other interactions, like as customer support chats or sales calls, are stored in SaaS apps or homegrown internal tools. You can only access this data through each of their limited reporting UIs, exports, or possibly an API. There’s no magic wand for joining these disparate streams.

There are two common approaches to join these multiple datasets: (1) send all data to one tracking service or (2) move the data to a common place. Sending all the data to a single tracking service, like Google Analytics or Mixpanel, is cumbersome, nearly impossible to implement, and inflexible to query. Dead end.

The better approach is to move all customer data into a common store. Amazon Redshift, Google BigQuery, and Azure SQL Warehouse are our favorite cloud data stores for large scale analytics. They’re fast, easy to manage, and decreasing in price. But moving data from 3rd party sources to these cloud databases is still hard. You have to work with multiple APIs, find join keys, schedule data pulls, handle backfill, etc… It’s a huge headache.

There are two strategies: build or buy. Building your own is complicated - we have a post about it!


  💖  


Today, thanks to Segment Sources, bringing all of your customer touch points together got easier. They’ve made it easy to pull customer data from many sources, such as Salesforce, Zendesk, and behavioral data from your app or website, into a single warehouse. We worked with Segment to write some starter queries - give them a go. If you’re already using Wagon and connect a Segment Warehouse, you can open these queries right in the app.

Segment will be expanding its Source catalog in the coming months, and you can always check out the current catalog to see the latest additions. We’re excited to partner with Segment on our shared mission to make answering questions with data easy.

We also recommend other tools to help unify your various customer data sets: Fivetran, RJ Metrics Pipeline, and Snowplow – so that you’ll always have access to the data you need in Wagon, no matter how you choose to set up your pipeline.

Modern ETL + modern SQL tools = win!

Deploying Electron
February 23, 2016 | Matt DeLand

Electron makes it easy to write cross-platform desktop apps using web technology— but how do you build and deploy these hybrid apps in production?

At January’s Bay Area Electron Meetup, we presented how Wagon builds and delivers new updates to our users. This talk covers both standard and custom Electron usage for background updating, uninterrupted usage, helper processes, cross-platform builds, OS notifications, and the occasional hotfix.


Thanks to Ben Gotow at Nylas for organizing!

Haskell for commercial software development
February 02, 2016 | Mike Craig

Inquiring minds on Quora want to know, is Haskell suitable for commercial software development?

I’m looking for an alternative to Python mainly for reasons of Python having so much trouble with workable concurrency, and Go seemed great at first look (it’s still not bad, but indeed error handling and lack of generics are a few features cut away too far).

This pretty much leaves Haskell as workable alternative. But is it really a practical alternative?

My biggest fear that while Haskell seems very powerful, it is also difficult and so it may be too difficult for person A to read code written by person B (what seems to have killed Lisp in my view is that problem and person-specific DSLs written in Lisp were simply unreadable for too many other people).

PS, in other words: I do not ask about strong sides of Haskell (or language X), those are well known, but rather about a lack of show-stoppers.

Absolutely, yes! At Wagon, we’ve built server- and client-side systems in Haskell, running in AWS and distributed to our users in a cross-platform desktop app.

Wagon on Quora

Possible show-stoppers to using Haskell commercially:

Community support
A good open-source community is a big deal. It’s the difference between fixing other people’s code and finding an open GitHub issue with a workaround and an in-progress fix. With Haskell, we find the latter much more frequently than the former. The community is positive, active, and generally helpful when problems come up. Lively discussion takes place on Reddit, IRC, a public Slack channel, and several mailing lists.

Hiring
There are many factors involved in hiring: team size, location, remove vs on-site, other expertise required, etc. At Wagon we’re an on-site team of ~10 in San Francisco, with lots of web- and database-related problems to work on. Finding qualified Haskellers has not been an issue.

Haskell is different from the languages lots of developers are used to. ML-style syntax, strong types, and lazy evaluation make for a steep initial learning curve. Anecdotally, we’ve found Haskell’s secondary learning curve smooth and productive. Intermediate Haskellers quickly pick up advanced concepts on the job: parsers, monad transformers, higher-order types, profiling and optimization, etc.

Available industrial-strength libraries
Haskell lends itself to lightweight-but-powerful libraries that do one thing very well. We get a lot done by composing these small libraries rather than relying on big feature-complete frameworks. Some libraries have become de-facto standards that are stable and performant, and Stackage gives us a reliable way to find additional packages that usually “just work”.

We do sometimes find gaps in Haskell’s available libraries. For example, Haskell can’t yet match Python’s tools for numerical computing. And like every open-source ecosystem, there is some bad code out there. We rely on community recommendations and a sense of dependency hygiene to stay productive.

Performance
A quick Google search turns up benchmarks comparing Haskell to anything else you can imagine. In practice, we’re interested in tools with two properties:

  1. Fast enough under most circumstances. “Fast enough” means performance is never an issue for a given piece of code.
  2. Easy to optimize when needed. When we do need more speed, we’d like to get it incrementally rather than immediately rewrite that code in C.

Haskell has both of these properties. We almost never have to address performance in our Haskell code. When we do, we turn to GHC’s profiler and its many performance knobs: optimization flags, inlining, strictness and memory layout controls, etc. Haskell also has good interopability with C for when that’s appropriate.

Memory leaks
There’s a lot of talk about memory leaks caused by Haskell’s lazy evaluation model. We run into this very infrequently in practice, and in those rare situations GHC’s profiler has led us to solutions quickly. The real problem is not lazy evaluation but lazy IO, which we avoid with tools like conduit and pipes.

Debuggability
Haskell doesn’t have a debugger in the same sense as Python or Java. This is mostly a non-issue, because exceptions are rare and GHCi gives us a flexible way to run code interactively. Nonetheless, hunting down problems in a big application can be difficult. This has improved in recent versions of GHC with support for proper stack traces.

Haskellers reading other Haskellers’ code
This hasn’t been a problem for us. Haskell is a flexible language both in syntax and semantics, but this leads to better solutions more than it leads to opaque or unreadable code. Heavy indirection—via embedded DSLs or deep typeclass hierarchies—is unusual in practice. We reach for those tools as a last resort, knowing that they come with a serious maintenance cost. Agreeing on a basic style guide helps smooth out minor syntax differences.

Read our other Haskell engineering blog posts, come to our community events, or better yet, join our team!

Python UDFs in Amazon Redshift
February 01, 2016 | Elise Breda (Yhat)

Elise Breda is the Growth Strategist at Yhat, a tech company that builds software for deploying predictive analytics to mobile and web apps. When she isn’t at their awesome new office in DUMBO, she can be found exploring the bike paths and Thai restaurants of NYC.

In the Beginning…

In the early 2000’s, the behemoth collection of cloud computing services we now know as Amazon Web Services (AWS) was little more than sparks firing in Chris Pinkham and Benjamin Black’s neurons. In 2003, the two presented a paper (blog post here) outlining a radical vision for a retail computing infrastructure that would be standardized, automated and built upon web services. The next year, Simple Queue Service, the first AWS service for public usage, was launched.

Fast forward almost a decade, and AWS is now the most commonly used cloud platform among enterprise software developers. AWS products span the gamut of web services, from computation (eg EC2) to networking (eg VPC) and content delivery (eg S3). In this post we’ll explore a small fraction of a fraction of the AWS ecosystem–a database that’s generating all kinds of groundswell right now: Amazon Redshift.

The rest of this post will talk about Redshift at a high level and then dive into a mini overview of User Defined Functions (UDFs), how they work, why they’re great, and how to start using them.

Amazon Redshift

Amazon Redshift is a hosted data warehouse that’s accessible / easy to set up, and built for speed and suitable for a variety of combining, storing, and compute-heavy analytics tasks.

Two things make Redshift particularly attractive. First, Redshift can handle insane amounts of data–it is a petabyte-scale warehouse. A petabyte is a lot (1015 bytes) of data. As a point of reference, the entire master catalog of Netflix video in 2013 amounted to about 3.14 petabytes of storage space (interesting read on Quora). Second, unlike Amazon’s other hosted database product, Amazon RDS, Redshift stores data according to column-based structure. Column orientation is good for tables containing columns with lots of repeated values (i.e. Credit Card Names, County/State, Product Type, etc, like CRM data. The benefit of column data is that because it’s uniform, there are opportunities for storage size optimization via compression. You can read more about how to maximize compression here.

Column orientation compression

Redshift handles large scale column-oriented datasets using massive parallel processing, performing coordinated computations across a large number of processors in parallel, making it a fast and powerful data warehouse option.

Data Warehouse Setup in the Data Age

Even just a few years ago, getting a data warehouse and proper ETL processes in place was a long, painful and probably very expensive ordeal. But we’ve arrived in the data age where easy-to-use, affordable data solutions are bountiful.

At Yhat, we use a Redshift to warehouse everything–CRM data (we use SFDC), product data, site metrics from Google Analytics, and data from a bunch of other data. It took us about 20 mins to set up the database on AWS, and it took us…wait for it…another 20 mins or so to set up all of our ETL using Fivetran which we couldn’t be more impressed with.

SQL IDE Done Right

Most SQL IDEs of yesteryear leave something to be desired in terms of UX. The majority are clunky and have super old school frankenstein UIs. Why they all focus on making exploring the DB schema rather than on making it easy to write queries, view results and think critically about your data has always been a mystery.

Well those days are also over. Wagon is the query-focused SQL app I’ve been looking for for years. Wagon boasts a clean UX designed analysts. Features are carefully chosen with a keen eye for usability for people writing tens or hundreds of queries per day. Wagon gets it in spades.

Overview of Python UDFs in Redshift

UDF stands for user-defined function, meaning that you can add functions to an environment (in this case, Redshift) in addition to those that come built in. Python UDFs allow you combine the power of Redshift with what you know and love about the Python programming language without switching between IDEs or systems.

The great thing about UDFs in Redshift is that Redshift will automatically execute it using its MPP architecture. One caveat to keep in mind is that your Python code still won’t execute as quickly as native SQL functions (AVG, MIN, MAX, etc.) that are baked into the database.

How to Use UDFs

You can certainly work with text in pure SQL, but some tasks are just easier to do in a scripting language like Python instead. Here’s a toy example to illustrate how to use Python functionality within Redshift using a UDF.

Suppose a column in one of our tables contains huge chunks of text or html, and we’re interested to find any email addresses within any one record. Let’s write a function that will take in raw text and return a pipe | separated string containing any email addresses found within the input text document. Define the function like so:

Once defined, you can use it like this:

This is a scalar function, so it’ll return one record for each input row (i.e. not an aggregate function). One thing to remember is that your UDFs are per-database, meaning that if you have multiple in your Redshift cluster, you’ll need to define your functions in each database.

Example

Redshift Python UDFs are based on Python 2.7 and come preloaded with a lot of our favorite libraries, including NumPy, SciPy and Pandas. You can also import custom modules from S3 and the web.

Here’s the template published on the AWS blog that you can use to start creating your own scalar functions:

The scalar UDFs that you create will return a single result value for each input value. Once you’ve defined a UDF, you can use it in any SQL statement. One thing to remember is that your UDFs are per-database, meaning that if you have multiple in your Redshift cluster, you’ll need to define your functions in each database.

Helpful Resources

To learn more about Python UDFs in Redshift, check out Amazon’s documentation, which is super helpful and covers everything from constraints to security and python support. You can also check out the initial release blogpost and a more extensive post that uses UDFs to analyze the CMS Open Payments data set.

Yhat

Yhat’s flagship product, ScienceOps, empowers teams of data scientists deploy their models into web and mobile applications. These models are embedded into production applications via REST APIs without any recoding from their native statistical language.