BigQuery Mirrors
BigQuery is a fully managed enterprise data warehouse platform from Google that helps you manage and analyze your Reach data with built-in features like machine learning, geospatial analysis, and business intelligence. Your data team may already be familiar with BigQuery from its various uses around our movement space including powering the DNC’s “Phoenix” analytics platform and powering Community Tech Alliance’s “Progressive Action Database”.
BigQuery also offers a number of additional features that can help you to get more value from your Reach data including:
- Powerful querying language: BigQuery supports a powerful querying language that allows you to analyze your data in a variety of ways.
- Data visualization tools: BigQuery offers a number of data visualization tools that can help you to create interactive reports and dashboards.
- Integration with other Google Cloud Platform services: BigQuery can be integrated with other Google Cloud Platform services, such as Google Analytics and Google Cloud Dataflow.
- Code checking that can direct you to where there is an error.
Costs
One BigQuery Mirror setup is included in your Reach Movement subscription for no added cost. BigQuery Mirrors are only available to our Reach Movement subscribers and are especially well-suited to our larger organizations with their own tech teams.
Be careful, though! Querying data and other operations in BigQuery are billed by Google to your Google Cloud billing account. These costs can add up quickly, so make sure you understand the queries you’re running! See “Tips for Using this Data” below for some cost-saving suggestions.
Format and Data Schemas
Your BigQuery Mirror will be in the form of a dataset which will include 32 views of tables directly mirroring the Reach database, exposing all your data in a raw data-engineer ready relational database format. You can see a data dictionary here. To stay up to date on changes we make to the data schema over time, please sign up for email notifications here and keep an eye on that page.
How it Works
- Accessing Your Files: You will have access to a BigQuery dataset that you can access and query like any other BigQuery data. You may want to setup a recurring job to query the views and copy them into tables in your own project.
- Export Frequency: Data is close to real-time, with updates every 15 minutes or so.
- Data Format: Almost all possible Reach data will be available to you in your BigQuery mirror. Our data dictionary explains what will be available in greater depth. Most tables will have a
campaign_id
column to distinguish between campaigns and if you are receiving data in a BigQuery mirror spanning multiple members/affiliates/partners/down-ballot campaigns, you will need to divide the data up by campaign ID into other views or tables to give your individual stakeholders access to only the data they should receive. Tables without acampaign_id
include:pipelines
: all campaigns share access to the same set of pipelines so no permissions are necessary and all rows should be surfaced to all end users.social_platforms
: all campaigns share access to the same set of social_platforms so no permissions are necessary and all rows should be surfaced to all end users.users
: Each distinct Reach user only gets one row in theusers
table no matter how many campaigns they are a member of. Yourusers
view from us will include a row for every user who is a member of any of your campaigns. You should use theuser_to_campaign_associations
view to check which users are in each campaign and join from there back tousers
to give stakeholders access to only theusers
they ought to be able to see.user_devices
: Similar tousers
, please use theuser_to_campaign_associations
view to check which users are in each campaign and join from there back touser_devices
to give stakeholders access to only the devices they ought to be able to see.
- Data Expiration: You will have access to your data as long as your Reach subscription is active.
BigQuery for people using Phoenix or Progressive Action Database
If you are using State Party/DNC data, your data will be available directly in Phoenix, but you will need to contact your state party to inquire about direct access to Phoenix and to these Reach tables specifically. Phoenix users can learn more about the Reach data flow back to Phoenix here.
If you are using PAD, please contact the folks at Community Tech Alliance at [email protected] and let them know you are interested in having them sync your Reach data in PAD.
BigQuery Mirrors vs Scheduled Exports
There are many advantages to using BigQuery over our existing scheduled exports which we’ve outline below:
- Data is close to real-time with upgrades roughly every 15 minutes as opposed to scheduled exports which are typically delivered once a day.
- With scheduled exports you’d be getting CSV files that share the same condensed schema and limited column layout of our manual exports, but with the BigQuery Mirror you’ll be getting a much more complete view of the data with almost every column from our actual app database with raw values that are designed specifically to be used by data folks in a relational database environment. See our BW data dictionary here.
- Scheduled Exports only include 10 tables while the BQ Mirror will include over 30.
- You don’t have to write any code to merge incremental CSVs into your database and we don’t have to worry about any data slipping through the cracks of the daily incremental exports.
How to Get Started
If you’re a Movement client, email us to set up your BigQuery Mirrors. We will need to know which external ID you want the mirror to use (determined by which voter file it will be joined to i.e. VAN ID), and what Google accounts you want to have access to it.
If you’re not yet a Movement client, email us to let us know you are interested in BigQuery and would like to upgrade to Movement. Once you’ve upgraded to Movement we can begin the process to get you set up on BigQuery by following the steps above.
Tips for using this data
Because we are giving you access to all this data in the form of views, you will find that you don’t have quite as much flexibility with how you use it as you do with other data you have in BQ datasets. Some things to keep in mind with these views are:
- They exist in a specific BQ region (us-east4) so the views cannot be part of joins directly with data in a different region (for example, DNC Phoenix data).
- They are in Reach’s project, not your own.
- You will only have “Data Viewer” permissions on the dataset.
So, many of our clients who want a bit more flexibility with the data, choose to select from our views and periodically copy all the data from them into their own dataset in their own project. This overcomes the limitations above and we recommend this flow in many cases.
There are many ways to accomplish this, but if you would like to use a scheduled python script, we can offer the following two examples to get you started:
- This script was written by Brittany Bennett and Austin Wesigrau at the Working Families Party. It uses Parsons and works well if your destination dataset is in the us-east4 region like the Reach dataset is.
- This script was written by Ari Ghasemian at the Wisconsin Democratic Party and uses an in-memory pandas dataframe to move the data to a dataset in a different region.
Whatever method you use, please make sure you are dynamically querying the table names and column names so that added tables and columns are not going to be breaking changes which cause errors in your script. We will not typically provide advance notice of non-breaking changes like table/column adds.