You will require this project ID while configuring BigQuery as a destination in RudderStack.
Ensure that billing is enabled for the project to enable RudderStack to load data into the BigQuery cluster.
Create a new Google Cloud Storage (GCS) bucket or provide an existing one to store files before loading data into your BigQuery.
To ensure loading data from cloud storage to BigQuery, make sure to co-locate your GCS storage bucket with BigQuery. More information can be found here.
Setting up the Service Account for RudderStack
Create a new service account under IAM & Admin>Service Accounts
Add Service Account permissions as specified below:
Add the Storage Object Creator and Storage Object Viewer roles to the account.
Add the BigQuery Job User and BigQuery Data Owner roles to the account.
If a dataset with the name (configurable by the setting: Namespace in the RudderStack destination settings) already exists, the role of BigQuery Data Editor would suffice instead of BigQuery Data Owner.
Create a key for the service account with JSON as the type and store it.
Service Account Permissions
Create and download the private JSON key which will be required while configuring BigQuery as a destination in RudderStack, as shown:
JSON key required for the RudderStack UI
Make sure that you create the service account in the same project as BigQuery.
Setting Up the Network Access
If necessary, you can whitelist the RudderStack IPs to enable network access to it.
The IPs to be whitelisted are : 188.8.131.52, 184.108.40.206 , 220.127.116.11 , 18.104.22.168, and 22.214.171.124.
Configuring Google BigQuery in RudderStack
To enable sending data to Google BigQuery, you will first need to add it as a destination in RudderStack. Once the destination is configured and enabled, events from RudderStack will start to flow to BigQuery.
To do so, please follow these steps:
Choose a source to which you would like to add Google BigQuery as a destination.
Once you have set up the source, select Google BigQuery from the list of destinations. Give your destination a name, and then click on Next. You will then see the following Connection Credentials screen:
Google BigQuery Configuration Settings in RudderStack
Copy the contents of the credentials JSON file you created and stored, in the section above.
That's it! You have successfully added Google BigQuery as a destination in RudderStack.
Schema, Partitioned Tables, and Views
RudderStack uses the source name (written in snake case, e.g. source_name) to create a dataset in BigQuery.
More details about the tables and columns created by RudderStack can be found here.
RudderStack creates ingestion-time partition tables based on the load date, so you can take advantage of it to query a subset of data.
More details about the BigQuery partitioned tables here. Information on RudderStack creates these tables on load can be found here.
In addition to tables, a view (<table_name>_view) is created for every table for de-duplication purposes.
More information on views can be found in this guide.
We recommend that you use the corresponding view (which contains the events from the last 60 days) to avoid duplicate events in your query results. Since BigQuery viewsare merely logical views and are not cached, you can create a native table from it to save money by avoiding running the query that defines the view every time.
Users can modify the view query to change the time window of the view. The default value is set to 60 days.
How are reserved words handled by RudderStack?
There are some limitations when it comes to using reserved words in a schema, table, or column names. If such words are used in event names, traits or properties, they will be prefixed with a _when RudderStack creates tables or columns for them in your schema.
Besides, integers are not allowed at the start of the schema or table name. Hence, such schema, column or table names will be prefixed with a _.
For instance, '25dollarpurchase' will be changed to '_25dollarpurchase'.
When sending data into a data warehouse, how can I change the table where this data is sent?
By default, RudderStack sends the data to the table / dataset based on the source it is connected to. For example, if the source is Google Tag Manager, RudderStack sets the schema name as gtm_*.
You can override this by setting the Namespace in the destination settings as shown:
I'm looking to send data to BigQuery through RudderStack and I'm trying to understand what data is populated in each column. How do I go about this?
Make sure that both your BigQuery dataset and the bucket have the same region.
When piping data to a BigQuery destination, I can set the bucket but not a folder within the bucket. Is there a way to put Rudderstack data in a specific bucket folder?
Yes, you can set the desired folder name in the Prefix input field while configuring your BigQuery destination.
Does open-source RudderStack support near-realtime syncing to BigQuery and Event Replay?
The near-realtime BigQuery syncing feature is currently under development and is planned to be released in the coming months. Unfortunately, Event Replay is not a part of open-source RudderStack currently.
What is the current sync frequency for BigQuery?
If you're using open-source RudderStack, the minimum sync frequency is 30 minutes. If you're self-hosting the data plane or using RudderStack Cloud Pro / Enterprise, you can tweak the config to set uploadFreqInS in config.yamlto 0 and determine the best possible value for near real-time sync.