22 Aug 2019

Introduction to Serverless Data Analysis with Google BigQuery and Cloud Dataflow

Serverless Data Analysis allows organizations to carry out no-ops data warehousing using BigQuery, and pipeline processing using Cloud Dataflow.

Google BigQuery is a petabyte scale data warehouse on Google Cloud that you interact primarily through SQL, and Cloud Dataflow is a data processing pipeline system that you can program against in either Python or Java.

Serverless Data Analysis is meant for people who build data pipelines and data analytics. It is imperative that an individual working with these tools have a solid understanding of SQL, because of interactions with BigQuery, and know either Python or Java, to work with Dataflow.

What is BigQuery?

BigQuery is Google’s no-ops solution to data warehousing and analytics systems, and by no-ops in this context essentially means that there is no infrastructure for you to manage, so no operations. It allows an individual to store data, analyze the data and export the data from a centralized location.

BigQuery offers:

Interactive analysis of petabyte scale databases.
Familiar, SQL 2011 query language and functions.
Many ways to ingest, transform, load, export data to/from BigQuery.
Nested and repeated fields, user-defined functions in JavaScript.
Data storage is inexpensive; queries charged on amount of data processed.

Benefits of BigQuery:

BigQuery separates out storage and compute.
Near-real time analysis of massive datasets.
No-ops; Pay for use.
Durable (replicated), inexpensive storage.
Immutable audit logs.
Mashing up different datasets to derive insights.

What is Dataflow?

Dataflow is a way by which you can execute Apache Beam data processing pipelines on the cloud. It does so in a series of steps, and the key thing about Dataflow is that these steps called transforms can be elastically scaled. The code that is written is in an open source API called Apache Beam, and Dataflow is not the only place that you can execute Apache Beam pipelines, you can execute them on Flink or Spark etc, but Cloud Dataflow is usually used as the execution service for when we have a data pipeline that we would like to execute on the cloud.

What is Apache Beam?

Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines. Using one of the open source Beam SDKs, you build a program that defines the pipeline. The pipeline is then executed by one of Beam’s supported distributed processing back-ends, which include Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow.

Tags:

0 comments