EventDB/AudienceDB Best Practices for Writing Queries

EventDB/AudienceDB Best Practices for Writing Queries

by on ‎10-11-2018 02:35 PM - edited 3 weeks ago by Community Manager (155 Views)

This article provides best practices for writing queries on EventDB and AudienceDB.

In this article:

Table of Contents Placeholder

Query Best Practices

This section provides an example of a "bad" query and describes how the query impacts results, and then provides best practice suggestions on how to create a "good" query with more efficient, meaningful results.

Example of a "Bad" Query

The following example shows a "bad" query, wherein the results are not as useful as they could be:

SELECT * FROM events_view__all_events__all_events

This query can impact your results in the following ways:

  • Inefficient – This query does not include a where clause filter which will result in the entire table being scanned and returning much more data than is needed.
  • Performance – Selecting all of the columns in a schema is an expensive operation, especially when using columnar databases such as Redshift. This query will take longer than necessary since, in most cases, all columns are not needed in the result set.

Best Practices for a "Good" Query

The following list suggests best practices for meaningful query results and provides code examples for each:

  • Select Specific Columns
    Only select the columns that contain the data you are interested in analyzing:
    SELECT "event - udo - referrer"
    FROM events_view__all_events__all_events
  • Predicate Filter (WHERE Clause)
    Wherever possible, use predicate column filters to reduce the subset of data to be queried.
    SELECT "event - udo - referrer"
    FROM events_view__all_events__all_events
    WHERE "event - udo - referrer" = 'google.com'
    AND "event - udo - page_name" = ‘Home’
  • Specify a Date Range
    Filter on date columns to limit the amount of data scanned.
    SELECT "event - udo - referrer"
    FROM events_view__all_events__all_events
    WHERE "event - udo - referrer" = 'google.com'
    AND "event - time" BETWEEN ‘2018-07-01’ and ‘2018-07-31’

About Redshift Spectrum

EventDB and AudienceDB are hosted on Amazon Redshift and Amazon Redshift Spectrum. This section describes issues that pertain only to Redshift Spectrum databases.

How do I know if I am on Redshift Spectrum or standard Redshift?

Redshift Spectrum users have this additional schema, in addition to the standard account__profile schema. 

  • account__profile__spectrum

How does using Redshift Spectrum impact my data and query performance?

Redshift Spectrum users connect to their Spectrum database in the same way that standard Redshift users connect to their databases. The differences are as follows:

  • Redshift Spectrum user data is stored in S3 instead of within a physical Redshift cluster disk.
  • Redshift Spectrum allows the customer to store and query large amounts of data in comparison to traditional Redshift database user storage.
  • Some users, depending on their data size and query requirements, are hosted on Redshift Spectrum as opposed to standard Redshift.
  • Redshift Spectrum table schemas have additional columns that are referred to as partition columns.

How do I use partition column predicate filters?

Partition columns, when queried appropriately, can vastly accelerate query performance when performing large scans on Redshift Spectrum databases.

The following table provides query examples with partition column predicate filters for AudienceDB and EventDB:

Environment Example

AudienceDB

SELECT col1, col2 
FROM account__profile.visitors_view
WHERE "visit - day" >= ‘2018-01-01’
SELECT col1, col2 
FROM account__profile.visits_view
WHERE "visit - day" >= ‘2018-01-01’

EventDB

SELECT col1, col2 
FROM account__profile.events_view__all_events__all_events
WHERE "event - day" >= ‘2018-01-01’

Additional Vendor Information

For additional vendor information about query best practices, see the following AWS documentation: