Data Hacking: Coding up a Recommendation Engine from Simple Playlist Data

 Data Hacking: Coding up a Recommendation Engine from Simple Playlist Data

I love Pandora. Type in an artist’s name and it starts playing similar stuff. Pandora’s recommendation engine feels like magic.

BigQuery provides a sample data set of some playlist data (Google’s @felipehoffa says the original data set was created by @apassant, awesome data!). The data is very simple: a single row for each track in the playlist. Track data contains playlist_id, artist (id and name), album (id and title) and track (id and title).

Using this simple data, we built a recommendation engine in Looker’s LookML that takes an artist, finds the most related artists and then recommends a playlist, all in about 300 lines of LookML.
View the code on Github

 First, Go Ahead, Play With It

Change the filter to your favorite artist and based on this data, we’ll recommend some songs.
Click on the artist or song title to play music.

 Step 1: Build out a Simple LookML Model

The table that we’re working with is structured like this:

 Table: Playlists

To build our LookML views and model, we need to build out a LookML dimension for each field in the table. Then we label each ‘object’ (things that would be in their own table in a de-normalized schema). For example, tracks.data.artist.id becomes ‘artist_id’.

  - dimension: artist_id
    view_label: Artist
    type: int
    sql: ${TABLE}.tracks.data.artist.id
    fanout_on: tracks.data

For each object, we also build a count. To count artists, we want to count the distinct values of artist_id. When drilling into an artist count, we want the artist’s id, name and the other counts.

  - measure: artist_count
    type: count_distinct
    sql: ${artist_id}
    drill_fields: [artist_id, artist_name, count, track_count,
      track_instance_count, album_count]

For grins, we build some linkage of artists to external sites so we can see their Twitter, Facebook, Wikipedia and YouTube pages, if they have them.

  - dimension: artist_name
    links:
    - label: YouTube
      url: http://www.google.com/search?q=site:youtube.com+{{value}}&btnI
      icon_url: http://youtube.com/favicon.ico
    - label: Wikipedia
      url: http://www.google.com/search?q=site:wikipedia.com+{{value}}&btnI
      icon_url: https://en.wikipedia.org/static/favicon/wikipedia.ico
    - label: Twitter
      url: http://www.google.com/search?q=site:twitter.com+{{value}}&btnI
      icon_url: https://abs.twimg.com/favicons/favicon.ico
    - label: Facebook
      url: http://www.google.com/search?q=site:facebook.com+{{value}}&btnI
      icon_url: https://static.xx.fbcdn.net/rsrc.php/yl/r/H3nktOa7ZMg.ico

See the complete LookML view file

See the complete LookML model file

 Learn More About the Data Set

Looks like there are about 500K playlists, with a total of about 12M tracks. There are 92K-ish different artists, with about 900K individual songs. Click on Explore Data then click on any of the numbers to drill into the data further.

Playlists Album Count Playlists Artist Count Playlists Count Playlists Track Count Playlists Track Instance Count
213,310 92,630 504,169 901,642 12,138,977

Explore From Here

 Who is the Top Artist (in this data set)?

Of course, it depends on how you count it. Which artist has the most instances of songs on playlists? Looks like Linkin Park. The really fun part of this is that after clicking Explore Data, clicking any number, takes you to the album, track or artist.

 Top Artists in Data Set

Explore From Here

Ranking is a great tool for building up knowledge about particular fields in a data set. The “Top 40” in a given week has long been a way of rating music.

We are going to rank tracks (songs) in their overall popularity (against all songs) and their popularity within an artist. We’d like to end up with a table like:

track_id artist_id overall_rank artist_rank

We can do this with a relatively simple 2-level query. The first level groups by track_id and artist_id and counts the number of playlists the song appears on. The second level (using window functions), calculates the overall rank of the song and the rank within (partitioned by) the artist.

 SELECT
    track_id
    , artist_id
    , row_number() OVER( PARTITION BY artist_id ORDER BY num_plays DESC) as artist_rank
    , row_number() OVER( ORDER BY num_plays DESC) as overal_rank
  FROM (
    SELECT 
      playlists.tracks.data.id AS track_id,
      playlists.tracks.data.artist.id AS artist_id,
      COUNT(*) as num_plays
    FROM (SELECT * FROM FLATTEN([bigquery-samples:playlists.playlists]
      ,tracks.data)) AS playlists
    GROUP EACH BY 1,2
  )

We build this into a derived table and add a couple of dimensions (see the full code):

  - dimension: rank_within_artist
    view_label: Track
    type: int
    sql: ${TABLE}.artist_rank

  - dimension: overal_rank
    view_label: Track
    type: int
    sql: ${TABLE}.overal_rank

 Top 10 Songs

With these new rankings we can now see the top 10 songs in our data set.

Explore From Here

Next, look at the ranking of the songs for each artist. We’d like more popular songs to have lower numbers. We’ve already computed rank_within_artist, let’s look at Frank Sinatra’s and Joan Baez’s top three songs. We notice that there is a data problem – there are two ids in the data for “Frank Sinatra” – but we’re just going to ignore the problem.

Change the filter to see a different artist’s top songs.

Explore From Here

 Step 3: Find Artists that Appear Together.

We’re now ready to build the core of our recommendation engine. SQL’s cross join (cross product) will allow us to build a mapping table that will ultimately look like this:

artist_id artist_name artist_id2 artist_name2 num_playlists

To get here, we need to build an intermediate table, playlist_artist. There is a record for every artist that appears on a playlist. This intermediate table will look like this:

artist_id artist_name playlist_id

Here’s the way we write this in LookML:

- explore: playlist_artist  # Just so we can test that it works
  hidden: true

- view: playlist_artist
  derived_table:
    sql: |
      SELECT 
        playlists.tracks.data.artist.id AS artist_id,
        playlists.tracks.data.artist.name AS artist_name,
        playlists.id AS playlist_id
      FROM (SELECT * FROM FLATTEN([bigquery-samples:playlists.playlists]
        ,tracks.data)) AS playlists
      WHERE playlists.tracks.data.artist.id IS NOT NULL
      GROUP BY 1,2,3
  fields:
  - dimension: artist_id
  - dimension: artist_name
  - dimension: playlist_id

Next we join playlist_artist with itself to find pairings of artists on playlists and count the number of times the parings occur. For each pair of artists, we then create a closeness ranking, by again, ranking one artist, with another based on the number of times playlists include both artists. We’ll ultimately have a table that looks like this:

a.artist_id a.artist_name a.playlist_id = b.playlist_id b.artist_id b.artist_name count(*)

And here’s how we get there in LookML:

- explore: artist_artist      
- view: artist_artist
  extends: artist
  derived_table:
    sql_trigger_value: SELECT COUNT(*) FROM [bigquery-samples:playlists.playlists]
    sql: |
      SELECT 
        *,
        row_number() OVER (partition by artist2_id order by num_playlists DESC) as closeness_rank
      FROM (
        SELECT
          a.artist_id as artist_id,
          a.artist_name as artist_name,
          b.artist_id as artist2_id,
          b.artist_name as artist2_name,
          COUNT(*) as num_playlists
        FROM ${playlist_artist.SQL_TABLE_NAME} AS a
        JOIN EACH ${playlist_artist.SQL_TABLE_NAME} as b 
          ON a.playlist_id = b.playlist_id
        WHERE a.artist_id <> b.artist_id
        GROUP EACH BY 1,2,3,4
      )

  fields:
  - dimension: artist_id    # Inherited from 'view: artist'
  - dimension: artist_name

  - dimension: artist2_id
  - dimension: artist2_name

  - dimension: num_playlists
    type: int
  - dimension: closeness_rank
    type: int

  - measure: total_playlists
    type: sum
    sql: ${num_playlists}

  - measure: count
    type: count
    drill_fields: [artist_id, artist_name, artist2_id, artist2_name, num_playlists]

Now, for any given artist we can find the most closely related artists. Put another artist into the filter to find the other artists most closely related to them.

Explore From Here

 Step 5: Mission Accomplished

For any artist, we now know their most popular songs and which artists are most closely related to them.

To recommend a playlist, we simply find the most closely related 10 artists and include each artist’s top 3 tracks.

 The Code

This is the complete code to the data model.

 playlist.model.lookml

- connection: bigquery_publicdata

- include: "*.view.lookml"

- explore: playlists
  hidden: true
  joins:
  - join: playlist_facts
    sql_on: ${playlists.playlist_id} = ${playlist_facts.playlist_id}
    relationship: one_to_one
    view_label: Playlists

  - join: track_rank
    sql_on: ${playlists.track_id} = ${track_rank.track_id}
    relationship: one_to_one
    type: left_outer_each
    view_label: Track
    fields: [track_id, overall_rank, rank_within_artist]

- explore: recommender
  view: artist_artist
  always_filter:
    track_rank.rank_within_artist: <= 3
  joins:
  - join: track_rank
    sql_on: ${artist_artist.artist_id} = ${track_rank.artist_id}
    relationship: one_to_many
    type: left_outer_each

 playlists.view.lookml

# Basic playlist view 

- view: playlists
  extends: [artist,track]
  sql_table_name: |
      [bigquery-samples:playlists.playlists]

  fields:
  - measure: count
    type: count_distinct
    sql: ${playlist_id}
    drill_fields: [playlist_id]

  - dimension: rating
    type: int
    sql: ${TABLE}.rating

  - dimension: playlist_id
    type: int
    sql: ${TABLE}.id

  - dimension: artist_id
    view_label: Artist ID
    type: int
    sql: ${TABLE}.tracks.data.artist.id
    fanout_on: tracks.data

  - dimension: artist_name
    view_label: Artist
    type: string
    sql: ${TABLE}.tracks.data.artist.name
    fanout_on: tracks.data

  - measure: artist_count
    type: count_distinct
    sql: ${artist_id}
    drill_fields: [artist_id, artist_name, count, 
      track_count, track_instance_count, album_count]

  - dimension: album_id
    view_label: Album
    type: int
    sql: ${TABLE}.tracks.data.album.id
    fanout_on: tracks.data

  - dimension: album_title
    view_label: Album
    type: string
    sql: ${TABLE}.tracks.data.album.title
    fanout_on: tracks.data
    links:
    - label: iTunes
      url: http://www.google.com/search?q=itunes.com+{{artist_name._value}}+{{value}}&btnI

  - measure: album_count
    type: count_distinct
    sql: ${album_id}
    drill_fields: [album_id, album_title, count, track_count, artist_count]

  - dimension: track_title
    view_label: Track
    type: string
    sql: ${TABLE}.tracks.data.title
    fanout_on: tracks.data

  - dimension: track_id
    view_label: Track
    type: int
    sql: ${TABLE}.tracks.data.id
    fanout_on: tracks.data

  - measure: track_count
    type: count_distinct
    sql: ${track_id}
    drill_fields: [track_id, track_title, count]

  - measure: track_instance_count
    type: count_distinct
    sql: CONCAT(CAST(${track_id} AS STRING),CAST(${playlist_id} AS STRING))
    drill_fields: detail

  sets:
    detail:
    - playlist_id
    - artist_name
    - album_title
    - track_title

 playlist_facts.view.lookml

# Facts about playlists, number of different artists and number of tracks on each playlist
#  Used to filter out crappy playlists.

- view: playlist_facts
  derived_table:
    sql_trigger_value: SELECT COUNT(*) FROM [bigquery-samples:playlists.playlists]
    sql: |
      SELECT
    id as playlist_id
        , COUNT(DISTINCT tracks.data.artist.id) as num_artists
        , COUNT(DISTINCT tracks.data.id) as num_tracks
  FROM FLATTEN([bigquery-samples:playlists.playlists],tracks.data)
      GROUP BY 1
      HAVING num_artists > 0
  fields:
  - dimension: playlist_id
    hidden: true
  - dimension: num_artists
    type: number
  - dimension: num_tracks
    type: number

 artist.view.lookml

# Base definition for artist
#  Declares external links

- view: artist
  fields:
  - dimension: artist_id
  - dimension: artist_name
    links:
    - label: YouTube
      url: http://www.google.com/search?q=site:youtube.com+{{value}}&btnI
      icon_url: http://youtube.com/favicon.ico
    - label: Wikipedia
      url: http://www.google.com/search?q=site:wikipedia.com+{{value}}&btnI
      icon_url: https://en.wikipedia.org/static/favicon/wikipedia.ico
    - label: Twitter
      url: http://www.google.com/search?q=site:twitter.com+{{value}}&btnI
      icon_url: https://abs.twimg.com/favicons/favicon.ico
    - label: Facebook
      url: http://www.google.com/search?q=site:facebook.com+{{value}}&btnI
      icon_url: https://static.xx.fbcdn.net/rsrc.php/yl/r/H3nktOa7ZMg.ico

 artist_suggest.view.lookml

# Simplifed view of the top 5000 artists so we can make resonable suggestions for artists.

- view: artist_suggest
  derived_table:
    sql_trigger_value: SELECT COUNT(*) FROM ${playlist_artist.SQL_TABLE_NAME}
    sql: |
      SELECT 
        artist_name
        , COUNT(*)
      FROM ${playlist_artist.SQL_TABLE_NAME}
      GROUP BY 1
      ORDER BY 2 DESC
      LIMIT 5000

  fields:
  - dimension: artist_name

 playlist_artist.view.lookml

- explore: playlist_artist  # for debugging.
  hidden: true

#  Simple table of playlists artist appears on.  One row for every artist/playlist combination   

- view: playlist_artist
  derived_table:
    sql_trigger_value: SELECT COUNT(*) FROM [bigquery-samples:playlists.playlists]
    sql: |
        SELECT 
          playlists.tracks.data.artist.id AS artist_id,
          playlists.tracks.data.artist.name AS artist_name,
          playlists.id AS playlist_id
        FROM (SELECT * FROM FLATTEN([bigquery-samples:playlists.playlists]
          ,tracks.data)) AS playlists
        JOIN ${playlist_facts.SQL_TABLE_NAME} AS playlist_facts 
          ON playlists.id = playlist_facts.playlist_id
        WHERE playlists.tracks.data.artist.id IS NOT NULL
          AND playlist_facts.num_artists < 10  
        GROUP EACH BY 1,2,3
  fields:
  - dimension: artist_id
  - dimension: artist_name
  - dimension: playlist_id

 artist_artist.view.lookml

# The core of the recommendation engine.  Cross joins playlist_artist to build a list of 
#  related artists.

- view: artist_artist
  extends: artist
  derived_table:
    sql_trigger_value: SELECT COUNT(*) FROM [bigquery-samples:playlists.playlists]
    sql: |
      SELECT 
        *,
        row_number() OVER (partition by artist2_id order by num_playlists DESC) as closeness_rank
      FROM (
        SELECT
          a.artist_id as artist_id,
          a.artist_name as artist_name,
          b.artist_id as artist2_id,
          b.artist_name as artist2_name,
          COUNT(*) as num_playlists
        FROM ${playlist_artist.SQL_TABLE_NAME} AS a
        JOIN EACH ${playlist_artist.SQL_TABLE_NAME} as b 
          ON a.playlist_id = b.playlist_id
        WHERE a.artist_id <> b.artist_id
        GROUP EACH BY 1,2,3,4
      )

  fields:
  - dimension: artist_id    # Inherited from 'view: artist'
  - dimension: artist_name

  - dimension: artist2_id
  - dimension: artist2_name

  - dimension: num_playlists
    type: int

  - dimension: closeness_rank
    type: int

  - measure: total_playlists
    type: sum
    sql: ${num_playlists}

  - measure: count
    type: count
    drill_fields: [artist_id, artist_name, artist2_id, artist2_name, num_playlists]

 track_rank.view.lookml

# Rank tracks both overall and within a given artist.   

- view: track_rank
  extends: track
  derived_table:
    sql_trigger_value: SELECT COUNT(*) FROM [bigquery-samples:playlists.playlists]
    sql: |
      SELECT
        track_id
        , track_title 
        , artist_id
        , artist_name
        , row_number() OVER( PARTITION BY artist_id ORDER BY num_plays DESC) as artist_rank
        , row_number() OVER( ORDER BY num_plays DESC) as overall_rank
      FROM (
        SELECT 
          playlists.tracks.data.id AS track_id,
          playlists.tracks.data.title AS track_title,
          playlists.tracks.data.artist.id AS artist_id,
          playlists.tracks.data.artist.name AS artist_name,
          COUNT(*) as num_plays
        FROM (SELECT * FROM FLATTEN([bigquery-samples:playlists.playlists]
          ,tracks.data)) AS playlists
        WHERE playlists.tracks.data.artist.id IS NOT NULL
          AND playlists.tracks.data.title IS NOT NULL
        GROUP EACH BY 1,2,3,4
      )

  fields:
  - dimension: track_id
    primary_key: true
    hidden: true
    type: int
    sql: ${TABLE}.track_id

  - dimension: track_title
    sql: ${TABLE}.track_title

  - dimension: artist_id
    type: int
    sql: ${TABLE}.artist_id

  - dimension: artist_name
    type: int
    sql: ${TABLE}.artist_name


  - dimension: rank_within_artist
    type: int
    sql: ${TABLE}.artist_rank

  - dimension: overall_rank
    view_label: Track
    type: int
    sql: ${TABLE}.overall_rank

  sets:
    detail:
      - track_id
      - artist_id
      - artist_rank
      - overall_rank

 
9
Kudos
 
9
Kudos

Now read this

Apple, Obstruction of Correspondence is a Crime

I’ve been using an iPhone for years. Once upon a time I admired Apple, but they seem to have changed. Apple’s business strategy seems to be make it incredibly painful for people to switch to Android by quietly obstructing message... Continue →