Google BigQuery is an industry-leading, fully-managed cloud data warehouse that allows you to store and analyze petabytes of data in no time.
RudderStack allows you to dump your customer event data into Google BigQuery by adding it as a destination to the source of your choice.
Follow these steps to set up Google BigQuery before adding it as a destination in RudderStack:
Create a Google Cloud Platform (GCP) project if you do not have an existing one. More details can be found here.
Enable the BigQuery API if you have an existing project and it is not enabled. More details can be found here.
Log into the Google BigQuery console. Copy the project ID as shown in the following image. You will require this project ID while configuring BigQuery as a destination in the RudderStack dashboard.
Create a new Google Cloud Storage (GCS) bucket, or provide an existing one to store files before loading data into your BigQuery.
Create a new service account under IAM & Admin > Service Accounts
Add Service Account permissions as follows:
Storage Object Creator and
Storage Object Viewer roles to the account
BigQuery Job User and
BigQuery Data Owner roles to the account. Note: If a dataset with the name (configurable by the setting: namespace in destination settings) already exists, role of
BigQuery Data Editor would suffice instead of BigQuery Data Owner
Create a key for the service account with JSON as the type and store it.
Create and download the private JSON key which will be required while configuring BigQuery as a destination in RudderStack, as shown:
You will need to whitelist the RudderStack IPs to enable network access to it.
In order to enable sending data to Google BigQuery, you will first need to add it as a destination to the source from which you are sending event data. Once the destination is enabled, events from RudderStack will start to flow to BigQuery.
To do so, please follow these steps:
Choose a source to which you would like to add Google BigQuery as a destination. You can also simply create the destination and connect it to a source later.
Once you have set up the source, select Google BigQuery from the list of destinations. Give your destination a name, and then click on Next. You will then see the following Connection Credentials screen:
Add your project ID and bucket name, as specified in the Setting up Google BigQuery section
Copy the contents of the credentials JSON file you created and stored, in the section above.
That's it! You have successfully added Google BigQuery as a destination in RudderStack. Once you start sending events from your source, RudderStack will dump them into the BigQuery bucket periodically.
The source name (written in snake case, e.g.
source_name) is used by RudderStack to create a dataset in BigQuery. More details about the tables and columns created can be found here.
RudderStack creates ingestion-time partition tables based on the load date, so that users can take advantage of it to query a subset of data. More details about BigQuery partitioned tables here and how we create it on load can be found here.
In addition to tables, a view (
<table_name>_view) is created for every table for de-duplication purposes. We recommend that you use the corresponding view (contains events from last 60 days) to avoid duplicate events in query results. Since BigQuery views are merely logical views and are not cached, you can create a native table from it to save money by avoiding running the query that defines the view each time.
There are some limitations when it comes to using reserved words in a schema, table, or column names. If such words are used in event names, traits or properties, they will be prefixed with a
_when RudderStack creates tables or columns for them in your schema.
Besides, integers are not allowed at the start of the schema or table name. Hence, such schema, column or table names will be prefixed with a
'25dollarpurchase' will be changed to