This article shows you how to solve the “Six Degrees of Kevin Bacon” game using a mixture of SPARQL and Python. Meanwhile, if you want to relax from problems like this, games such as w88 for mobile can be played.
SPARQL
SPARQL is a query language for triple stores that was born out of the Semantic Web effort. A triple is a simple 3 part statement or ‘fact’ such as “Australia is Country”. This consists of two objects (‘Australia’ and ‘Country’) and the predicate (also known as a property), ‘is’.
There are a number of triple stores online. We will be using DBpedia which stores triples extracted from the Wikipedia pages. we will assume the actor listings are complete, although this is unlikely.
SPARQL’s syntax is inspired by SQL and consists of a list of variables to return, a WHERE clause, and some other optional clauses such as ‘LIMIT’. The WHERE clause consists of one or more triples to match. Objects or properties can be replaced with variables indicated by a ‘?’ character. These are the values to return.
For example, if querying the DBpedia database, the following SPARQL query searches for all films that starred Kevin Bacon:
SELECT ?film WHERE { ?film dbo:starring :Kevin_Bacon . }
Try it with the DBpedia SPARQL endpoint at: dbpedia.org/snorql/ . The ‘dbo:’ is a namespace abbreviation. SPARQL finds all triples in its database that have the specified predicate and second object, and returns all of the ?film objects from these triples.
We can extend the above query to match multiple triples. This works like an ‘AND’. For example, the following queries all films that Kevin Bacon starred in, and all actors he starred with on those films:
SELECT DISTINCT ?film, ?actor WHERE { ?film dbo:starring :Kevin_Bacon . ?film dbo:starring ?actor . }
This can be extended with multiple steps (‘hops’) to additional films:
SELECT DISTINCT ?film, ?actor, ?film2, ?actor2 WHERE { ?film dbo:starring :Kevin_Bacon . ?film dbo:starring ?actor . FILTER (?actor != :Kevin_Bacon) ?film2 dbo:starring ?actor . FILTER (?film2 != ?film ) ?film2 dbo:starring ?actor2 . FILTER (?actor2 != ?actor ) FILTER (?actor2 != :Kevin_Bacon) }
FILTER is a potentially powerful statement and can be used to filter the matches within specific constraints. Here we use them to avoid “loops” where a path goes through the same actor and/or film multiple times. However it does not guarantee the shortest routes. For example, the above query gives a number of one-hop results (e.g. via “R.I.P.D.”) for Jeff Bridges but then also gives the two-hop result of “The Air I Breathe” / Forest Whitaker / “Blown Away”.
Another problem is that both the query text and the search space explode very quickly. The above query can be expanded to include six films, and to only return the path from the actor of interest to Kevin Bacon, but this will probably lead to a timeout on the public server, and it would not report shorter paths. Ideally we want something that will return the shortest path.
Enter Python…
Although DBpedia’s SPARQL implementation does include some graph traversing extensions, these are difficult to use and non-standard. Our solution is to use Python to script the SPARQL queries. This allows us to add additional logic, limiting the queries that are performed.
The SPARQLWrapper library is used to make SPARQL calls, and can be installed with pip. Responses can be in XML or JSON. We shall use parsed JSON for convenience. Here is an example SPARQL call:
from SPARQLWrapper import SPARQLWrapper, JSON sparql = SPARQLWrapper("http://dbpedia.org/sparql") sparql.setQuery(""" SELECT DISTINCT ?film, ?actor WHERE { ?film dbo:starring <http://dbpedia.org/resource/Kevin_Bacon> . ?film dbo:starring ?actor . } """ ) sparql.setReturnFormat(JSON) results = sparql.query().convert() print(results)
This queries DBPedia for all films starring Kevin Bacon, and all actors starring in those films. The parsed JSON result is displayed to the terminal.
From this, we can build up a new implementation. First some definitions, and then a new function query all films starring an actor, and all actors starring in those films:
import csv import random import re import string import sys import time from SPARQLWrapper import SPARQLWrapper, JSON # These list Films and Actors we've already visited (ie. have a route to) FilmsVisited = set() ActorsVisited = set() # Queries all films starring this actor (preV_actor) # Also returns all actors starring in these movies # The return value is a list of tuples (film, actor) # These and prev_actor are DBPedia labels (eg. "Kevin_Bacon) without # a namespace # sparql is a SPARQLWrapper object pointing to DBPedia def QueryFilmActors(sparql, prev_actor): time.sleep(0.1) sparql.setQuery(""" SELECT DISTINCT ?film, ?actor WHERE { ?film dbo:starring <http://dbpedia.org/resource/%s> . ?film dbo:starring ?actor . } """ % (prev_actor, ) ) sparql.setReturnFormat(JSON) results = sparql.query().convert() result_tuples = [] for result in results["results"]["bindings"]: film = result["film"]["value"] actor = result["actor"]["value"] # remove namespace film = film.split("/").pop() actor = actor.split("/").pop() result_tuples.append( (film,actor) ) # return the result (might be an empty list) return result_tuples
Notice the call to time.sleep(). This short delay is to throttle our queries. DBpedia is a public service and we should not flood it with a sudden influx of queries. You could remove this if you were working against your own local mirror of the DBpedia database.
To perform the search we use a very basic version of the Djikstra algorithm. This is used to find the optimum route through a graph network. The classic Djikstra algorithm has ‘weights’ (e.g. travel times) for the links and uses a priority queue to prioritize the next shortest path to work on. Here, all links (‘hops’) are of equal weight. Therefore we simply search through all 1-hop routes, then 2-hop routes, etc.
Our next method expands one path out to all films and actors that we have not yet visited:
# Find all new actors who have acted with this actor # Filters our films and actors that we have already visited on this # or other paths # # Returns a tuple of ( list of the new paths, list of dest paths) # the "dest paths" are new path(s) which reach the Kevin_Bacon destination # path: the current path leading up to this actor # actor: actor to query def FindActorsForPath( sparql, path, actor): global FilmsVisited, ActorsVisited new_path_list = [ ] reached_destination = [ ] candidates = QueryFilmActors(sparql, actor) for (new_film, new_actor) in candidates: if not new_film in FilmsVisited: if not new_actor in ActorsVisited: ActorsVisited.add(new_actor) new_path = list(path) new_path.append( (new_film, new_actor ) ) new_path_list.append(new_path) if new_actor == "Kevin_Bacon": reached_destination.append(new_path) # 2nd pass because we do not want to prematurely exclude films for (new_film, new_actor) in candidates: if not new_film in FilmsVisited: FilmsVisited.add(new_film) return (new_path_list, reached_destination)
Finally, we have the main loop:
# The main (command line) call if __name__ == "__main__": # Get the entity parameter # For simplicity, error checking has been omitted actor = sys.argv[1] # Create our SPARQLWrapper object sparql = SPARQLWrapper("http://dbpedia.org/sparql") # First iteration is with one call ActorsVisited.add(actor) all_paths = [ [] ] reached_dest = [] # Loop up to 6 levels in the search for step in range(0,6): # New step level next_new_paths = [ ] # Loop through all of our existing paths (all_paths), trying to # extend them. New paths are written to next_new_paths for this_path in all_paths: # Get the actor at the end of this path if len(this_path) == 0: this_actor = actor else: # last actor of this path this_actor = this_path[-1][1] (new_paths, reached_dest) = FindActorsForPath(sparql, this_path, this_actor) next_new_paths += new_paths # Did we find Kevin Bacon? if len(reached_dest) > 0: # We've found Kevin Bacon! break if len(reached_dest) > 0: break # save the new paths ready for the next iteration all_paths = next_new_paths # loop around to the next step or hop # do we have a result? If so, display it! if len(reached_dest) > 0: num_hops = len(reached_dest[0]) print("Kevin Bacon found in {} hops:".format(num_hops) ) for route in reached_dest: print("Route:") for (film,actor) in route: print(" Film: {}; Actor: {}".format(film,actor) ) else: print("Kevin Bacon was not found")
When using this script from the command line, specify the actor as the first parameter. The actor needs to be formatted as per the DBpedia label, e.g. “Gillian_Anderson” and without a namespace. The above script will add the namespace.
Results
Here are the results for John Mills and Charlie Chaplin:
$ python3 find_bacon.py John_Mills Kevin Bacon found in 2 hops: Route: Film: Zulu_Dawn; Actor: Bob_Hoskins Film: Balto_(film); Actor: Kevin_Bacon $ python3 find_bacon.py Charlie_Chaplin Kevin Bacon found in 3 hops: Route: Film: Brother,_Can_You_Spare_a_Dime%3F_(film); Actor: Bing_Crosby Film: Say_One_for_Me; Actor: Robert_Wagner Film: Wild_Things_(film); Actor: Kevin_Bacon
Further Extensions
You could just as easily search out from Kevin Bacon to the target actor. In fact, you could do this without a specific target and end up with a list of all actors who can be reached within six steps! The search space will explode for this many hops, so this is probably only practical if you are querying a local database.