Working with the Tickit Data

This topic provides information about loading the Tickit demo data and running the example queries. You can also load the data and run queries using the interactive AnzoGraph Tutorial notebook as described in Zeppelin Notebook Integration or you can copy the queries in this topic and run them in the AnzoGraph Query Console (if available with your solution), command line interface, or another interface.

Loading the Tickit Data

The following query loads the files in the tickit.ttl.gz directory in the Cambridge Semantics csi-sdl-data-tickit bucket on Amazon S3.

LOAD WITH 'global' <s3://csi-notebook-datasets/MovieTicketAnalysis/20190217/tickit.ttl.gz>
INTO GRAPH <tickit>

For more information about loading data from files, see Loading Data from Files.

When the load completes, you can run this query to return the total number of triples in the data set:

SELECT (count(*) as ?number_of_triples)
FROM <tickit>
WHERE { ?s ?p ?o }

Getting to Know the Tickit Data

The Tickit data set captures sales activity for the fictional Tickit website where people buy and sell tickets for sporting events, shows, and concerts. The data consists of person, venue, category, date, event, listing, and sales information. By identifying ticket movement over time, success rates for sellers, the best-selling events and venues, and the most profitable times of the year, analysts can use this data to determine what incentives to offer, how to attract new people, and how to drive advertising and promotions.

To help familiarize you with the data set, the following diagram shows the model or ontology for the tickit graph. Circles represent subjects or classes of data and rectangles represent properties.

To help familiarize you with the triples in the tickit graph, the diagram below shows an instance of a subset of the triples in the graph.

Running the Tickit Queries

This section describes each of the Tickit queries and includes detailed explanations of the query syntax. Queries are grouped into categories such as "The Basics," which covers introductory SPARQL concepts, and "Marketing," "Social Graph," and "Fraud," which demonstrate analytic use cases.

The Basics

The basic queries provide guidance for users who are new to the SPARQL query language. These queries demonstrate introductory SPARQL concepts such as fetching and filtering, traversing graphs by joining data, using aggregate functions, and writing subqueries.

Fetching and Filtering

The query below fetches the predicates and objects for a specific person.

SELECT ?predicate ?object
FROM <tickit>
WHERE {
<person49158> ?predicate ?object .
}
ORDER BY ?predicate
  • The SELECT list asks for all of the predicates and objects: SELECT ?predicate ?object.
  • The WHERE clause narrows the results to return just the predicates and objects that are related to person49158: WHERE { <person49158> ?predicate ?object }.
  • The ORDER BY clause orders the results by predicate name. By default ORDER BY lists results in ascending order. Since the results for ?predicate are string values, the results are in alphabetical order. To reverse the results to descending order, you can change ORDER BY ?predicate to ORDER BY DESC(?predicate).

Graph Traversal

The query below reports where and when events take place by traversing the tickit graph and creating joins between different classes in the tickit model.

SELECT ?eventname ?location ?date
FROM <tickit>
WHERE {
  ?eventid <venueid> ?venueid .
  ?venueid <venuename> ?location .
  ?eventid <dateid> ?dateid .
  ?dateid <caldate> ?date .
  ?eventid <eventname> ?eventname .
}
ORDER BY ?date ?eventname ?location
LIMIT 100

Since the location information for events exists in the venue class and the date data for events exists in the date class, the event, venue, and date data are joined to report on the location and date for the events.

For example, the following two triples join event ID to location using the venue ID for each event:

?eventid <venueid> ?venueid .
?venueid <venuename> ?location .

And these triples join the event ID to the calendar date using the date ID:

?eventid <dateid> ?dateid .
?dateid <caldate> ?date .

Aggregation

The query below uses the COUNT SPARQL aggregate function to count the number of times each event occurs.

SELECT ?event_name (count(*) as ?count)
FROM <tickit>
WHERE {
  ?event <eventname> ?event_name
}
GROUP BY ?event_name
ORDER BY desc(?count) ?event_name
LIMIT 10
  • The aggregate function in the SELECT list ((count(*) as ?count)) counts the event names (?event_name) produced by the WHERE clause.
  • Since the query includes an aggregate function, a GROUP BY statement (GROUP BY ?event_name) is required to specify any variables in the SELECT list that are not aggregated.
  • By using a LIMIT clause (LIMIT 10), the query reports only the 10 events that occurred most frequently.

Subqueries

The query below uses a subquery to find the total number of tickets sold, price paid, and commission paid for each event and then determine profit for each event.

SELECT ?event ?tickets ((?total_paid - ?commission_paid) as ?profit)
FROM <tickit>
WHERE {
  SELECT ?event (sum(?qty) as ?tickets) (sum(?comm) as ?commission_paid) (sum(?price) as ?total_paid)
  WHERE {
    ?sales <qtysold> ?qty .
    ?sales <eventid> ?eventid .
    ?eventid <eventname> ?event .
    ?sales <commission> ?comm .
    ?sales <pricepaid> ?price .
  }
  GROUP BY ?event
}
ORDER BY desc(?profit)
LIMIT 100
  • The subquery is processed first. The WHERE clause in the subquery joins sales and event data to return ticket, commission, and price paid information for each event.
  • The SELECT list for the subquery then calculates the sums of the total tickets, commission, and price paid for each event.
  • Because the subquery uses aggregate functions, it requires a GROUP BY statement to group on the non-aggregate variable (?event) in the SELECT list.
  • The top-level SELECT list for the query uses the subquery results to subtract the commission paid from the total price paid to calculate the profit for each event.

Marketing

The marketing queries provide analytics that answer questions a user might ask when making event marketing decisions.

Most Popular States

The query below reports on the most popular state to host events based on the number of venues per state.

SELECT ?state (count(?venue) as ?total)
FROM <tickit>
WHERE {
  ?venue <venuestate> ?state .
}
GROUP BY ?state
ORDER BY desc(?total) ?state
  • The aggregate function in the SELECT list ((count(?venue) as ?count)) counts the venues (?venue) produced by the WHERE clause.
  • Since the query includes an aggregate function, a GROUP BY statement (GROUP BY ?state) is required to group the non-aggregate variable in the SELECT list (?state).
  • By ordering the results by total venues in decending order (ORDER BY desc(?total) ?state), the most popular state becomes the first state in the results.

Least Popular Events

The query below determines the most unpopular events by returning the 10 events with the least number of ticket sales. It also returns the event category.

SELECT ?event ?category (sum(?qty) as ?total_tickets)
FROM <tickit>
WHERE {
  ?sales <qtysold> ?qty .
  ?sales <eventid> ?eventid .
  ?eventid <eventname> ?event .
  ?eventid <catid> ?catid .
  ?catid <catname> ?category .
}
GROUP BY ?event ?category
ORDER BY ?total_tickets
LIMIT 10

Like the "Most Popular States" query, this examples uses an aggregate function ((sum(?qty) as ?total_tickets)) to calculate the total tickets for each event.

In the WHERE clause, the following triples join the sales data and event name using the event ID:

?sales <eventid> ?eventid .
?eventid <eventname> ?event .

And these triples join the event with the category name on category ID:

?eventid <catid> ?catid .
?catid <catname> ?category .

Since the query uses the SUM aggregate function, the query includes a GROUP BY clause to group on the non-aggregate variables, ?event and ?category.

Inventory Aging

The query below reports on the 20 events for which tickets took the longest to sell.

SELECT ?location ?kind ?name ?list_date (((?selldate - ?list_date)) as ?sale_age)
FROM <tickit>
WHERE {
  ?sale <saletime> ?selldate .
  ?sale <eventid> ?event .
  ?listing <eventid> ?event .
  ?listing <listtime> ?list_date .
  ?event <eventname> ?name .
  ?event <venueid> ?venue .
  ?event <catid> ?cat .
  ?cat <catname> ?kind .
  ?venue <venuename> ?location .
}
ORDER BY desc(?selldate) desc(?list_date) ?location ?kind ?name
LIMIT 20
  • In the WHERE clause, the query traverses the sales, listing, event, category, and venue classes.
  • In the SELECT list, the query calculates the duration of the sale time by subtracting the listing date from the sale date: ((?selldate - ?list_date) as ?sale_age).

Social Graph

The social graph queries focus on finding connections between the people (buyers and sellers) in the tickit graph.

Event Partners

The query below finds 20 pairs of friends who went to the same event most often.

SELECT (count(*) as ?gone_together) ?name1 ?name2
FROM <tickit>
WHERE {
  ?sale1 <buyerid> ?person1 .
  ?sale1 <eventid> ?event .
  ?person1 <friend> ?person2 .
  ?person1 <firstname> ?first1 .
  ?person1 <lastname> ?last1 .
  BIND(CONCAT(?first1, " ", ?last1) AS ?name1)
  ?sale2 <buyerid> ?person2 .
  ?sale2 <eventid> ?event .
  ?person2 <firstname> ?first2 .
  ?person2 <lastname> ?last2 .
  BIND(CONCAT(?first2, " ", ?last2) AS ?name2)
  FILTER ( ?name2 > ?name1 )
}
GROUP BY ?name1 ?name2
ORDER BY desc(?gone_together) ?name1 ?name2
LIMIT 20
  • In the WHERE clause, the query traverses the sales, event, and person classes. The following triples find the events for which person1 bought tickets:
    ?sale1 <buyerid> ?person1 .
    ?sale1 <eventid> ?event .
  • The ?person1 <friend> ?person2 triple narrows the results to people who are friends of person2.
  • To simplify the query's resulting columns, the CONCAT function is used to concatenate the first and last name of person1: BIND(CONCAT(?first1, " ", ?last1) AS ?name1). It also adds a space between the first and last name. And the BIND function binds the concatenated name to the variable name1.
  • Similar to the first group of triples for person1, the triples for person2 find the events for which person2 bought tickets.
  • The FILTER (FILTER ( ?name2 > ?name1 )) is also used to simplify or clean the results to do an ASCII comparison of name1 and name2 to omit duplicate pairs such as these:
    "Joe Smith" <friend> "Bob Jones"
    "Bob Jones" <friend> "Joe Smith"

    Since the filter states that name2 is greater than name1, the results will list only one version of the triple, the one where name2 comes later in the alphabet than name1.

  • Finally, the SELECT list uses the COUNT aggregate function to count the number of times person1 and person2 bought tickets to the same event. Use of the aggregate function requires the GROUP BY clause to group on name1 and name2.

Potential Event Partners

The query below reports on people who might attend musicals together. This query uses the same concepts as the previous query, "Event Partners." The query also traverses the sales, event, and person classes, but finds pairs of friends who know someone who bought tickets to one or more musicals.

SELECT (count(*) as ?know_buyer) ?name1 ?name2
FROM <tickit>
WHERE {
  ?sale1 <buyerid> ?buyer .
  ?sale1 <eventid> ?event .
  ?buyer <friend> ?person2 .
  ?buyer <friend> ?person1 .
  ?person1 <firstname> ?first1 .
  ?person1 <lastname> ?last1 .
  BIND(CONCAT(?first1, " ", ?last1) AS ?name1)
  ?person2 <firstname> ?first2 .
  ?person2 <lastname> ?last2 .
  BIND(CONCAT(?first2, " ", ?last2) AS ?name2)
  ?event <catid> ?cat .
  ?cat <catname> "Musicals" .
  FILTER ( ?name2 > ?name1 )
}
GROUP BY ?name1 ?name2
ORDER BY desc(?know_buyer) ?name1 ?name2
LIMIT 20
  • In the WHERE clause, the first 8 triples find the person (?buyer) who bought tickets to events and the names of two people who are friends of the buyer.
  • The last 2 triples narrow the results to list only the events in the category "Musicals":
    ?event <catid> ?cat .
    ?cat <catname> "Musicals" .
  • Like the Event Partners query, the FILTER eliminates duplicate but reversed triples from the results.
  • In the SELECT list, the COUNT function counts the number of times person1 and person2 knew the person who bought tickets to an event.

Fraud

The fraud queries focus on finding problematic sales or people in the tickit graph, such as identity thieves and ticket scalpers.

Possible Ticket Scalpers

The query below identifies possible ticket scalpers by calculating the average price per ticket for events and then finding cases where tickets are listed for a higher price.

SELECT ?sellername ?avg_price ?priceperticket ?eventname ?listtime
FROM <tickit>
WHERE {
  { SELECT ?eventname (avg(?priceperticket) as ?avg_price)
  WHERE {
    ?listing <eventid> ?eventid .
    ?eventid <eventname> ?eventname .
    ?listing <priceperticket> ?priceperticket .
    }
    GROUP BY ?eventname
  }
  ?listing <listtime> ?listtime .
  ?listing <priceperticket> ?priceperticket .
  ?listing <sellerid> ?seller .
  ?seller <firstname> ?firstname .
  ?seller <lastname> ?lastname .
  BIND(CONCAT(?firstname, " ", ?lastname) AS ?sellername)
  FILTER (?priceperticket > ?avg_price)
}
ORDER BY desc(?priceperticket) ?sellername ?eventname
LIMIT 1000
  • The WHERE clause includes a subquery to calculate the average price of tickets for each event listing. Since subqueries are processed first, that calculation (?avg_price) becomes available to use for comparing with all of the prices listed by sellers.
  • Below the subquery, the other triples in the WHERE clause return the price per ticket and seller name for each listing. The FILTER (FILTER (?priceperticket > ?avg_price)) narrows the results to return just the listings where the price per ticket is greater than the average price for that listing.

People with the Same SSN

The query below reveals potential identity thieves by reporting on people who have the same social security number but different names.

SELECT ?first1 ?last1 ?sameSSN ?first2 ?last2
FROM <tickit>
WHERE {
  ?person1 <ssn> ?sameSSN .
  ?person1 <firstname> ?first1 .
  ?person1 <lastname> ?last1 .
  ?person2 <ssn> ?sameSSN .
  ?person2 <firstname> ?first2 .
  ?person2 <lastname> ?last2 .
  FILTER ( str(?person1) > str(?person2) )
}
ORDER by ?sameSSN
LIMIT 100
  • The WHERE clause includes triples to compare the first name, last name, and social security number for the people in the person class. Using the same variable, sameSSN, for person1 and person2 limits the results to people who have the same SSN.
  • The FILTER, like other queries, eliminates duplicate rows in the results. Since this filter performs the comparison on person1 and person2, which are subjects (URIs) in the tickit graph, the person URIs are converted to strings using the STR function.

Tickets Sold by Possible Identity Thieves

The query below builds on the previous query, "People with the Same SSN," by reporting on events where the seller who sold tickets is one of the people who has the same SSN as someone else.

SELECT ?firstname ?lastname ?eventname ?location ?pricepaid ?date
FROM <tickit>
WHERE {
# get all event info
  ?eventid <venueid> ?venueid .
  ?venueid <venuename> ?location .
  ?eventid <dateid> ?dateid .
  ?dateid <caldate> ?date .
  ?eventid <eventname> ?eventname .
# note the sellers
  ?sales <eventid> ?eventid .
  ?sales <pricepaid> ?pricepaid .
  ?sales <sellerid> ?thief .
# limit to thieves
  ?person <ssn> ?sameSSN .
  ?thief <ssn> ?sameSSN .
  ?thief <firstname> ?firstname .
  ?thief <lastname> ?lastname .
  FILTER ( ?person != ?thief )
}
ORDER BY ?date ?eventname ?location ?firstname ?lastname ?pricepaid
LIMIT 50
  • In the WHERE clause, the first group of triples traverses the venue, event, and date data to find the location and date for each event.
  • The second group of triples joins in the sales data to find the people who sold the tickets to the events.
  • The third group of triples compares people's social security numbers and narrows the list of sellers to return only the people who have the same SSN.

Tickets Sold by Friends of Possible Identity Thieves

The query below takes the "Tickets Sold by Possible Identity Thieves" query one step further to report on whether any friends of a possible identity thief sold tickets to events.

SELECT ?thief_name ?friend_name ?eventname ?location ?pricepaid ?date
FROM <tickit>
WHERE {
# get all ticket sales info
  ?eventid <venueid> ?venueid .
  ?venueid <venuename> ?location .
  ?eventid <dateid> ?dateid .
  ?dateid <caldate> ?date .
  ?eventid <eventname> ?eventname .
# limit to thieves
  ?person <ssn> ?sameSSN .
  ?thief <ssn> ?sameSSN .
  ?thief <firstname> ?thief_first .
  ?thief <lastname> ?thief_last .
  BIND(CONCAT(?thief_first, " ", ?thief_last) AS ?thief_name)
# note thieves friends
  ?friend <friend> ?thief .
  ?friend <firstname> ?friend_first.
  ?friend <lastname> ?friend_last .
  BIND(CONCAT(?friend_first, " ", ?friend_last) AS ?friend_name)
# note the sellers
  ?sales <eventid> ?eventid .
  ?sales <pricepaid> ?pricepaid .
  ?sales <sellerid> ?friend .
  FILTER ( ?person != ?thief )
}
ORDER BY ?date ?eventname ?location ?thief_name ?friend_name ?pricepaid
LIMIT 500
  • In the WHERE clause, the first group of triples traverses the venue, event, and date data to find the location and date for each event.
  • The second group of triples compares people's social security numbers to find the names of people who have the same SSN. To simplify the query's resulting columns, the CONCAT function is used to concatenate the first and last name of "thief": BIND(CONCAT(?thief_first, " ", ?thief_last) AS ?thief_name). It also adds a space between the first and last name. And the BIND function binds the concatenated name to the variable thief_name.
  • The third group of triples finds a list of the friends of people who have the same SSN.
  • And the fourth group of triples finds any sales where the seller is one of the friends found by the third group of triples.

Finance

The finance queries focus on analyzing the financial data in the tickit graph.

Big Spenders

The query below finds the 100 people who spent the most on tickets for events.

SELECT ?first ?last (sum(?dollars) as ?spent)
FROM <tickit>
WHERE {
  ?person <firstname> ?first .
  ?person <lastname> ?last .
  ?sale <buyerid> ?person .
  ?sale <pricepaid> ?dollars
}
GROUP BY ?first ?last
ORDER BY desc(?spent) ?first ?last
LIMIT 100
  • The triples in the WHERE clause traverse the person and sales classes, joining on the buyer ID, to list all of the people who bought tickets and the amount they spent per sale.
  • The aggregate function in the SELECT list ((sum(?dollars) as ?spent)) calculates the sum of all the dollars spent by each person.
  • Since the query uses the SUM aggregate function, the query includes a GROUP BY clause to group on the non-aggregate variables, ?first and ?last.
  • To sort the results to list the biggest spenders first, the ORDER BY statement (ORDER BY desc(?spent) ?first ?last) lists the amount spent in descending order.

Top Sales People

The query below finds the 100 people who made the most money selling tickets.

SELECT ?first ?last ?category (sum(?dollars) as ?earned)
FROM <tickit>
WHERE {
  ?person <firstname> ?first .
  ?person <lastname> ?last .
  ?sale <sellerid> ?person .
  ?sale <pricepaid> ?dollars .
  ?sale <eventid> ?event .
  ?event <catid> ?cat .
  ?cat <catname> ?category
}
GROUP BY ?first ?last ?category
ORDER BY desc(?earned) ?first ?last ?category
LIMIT 100
  • The triples in the WHERE clause traverse the person, event, category, and sales classes to list all of the people who sold tickets, the amount they earned per sale, and the category of the events.
  • The aggregate function in the SELECT list ((sum(?dollars) as ?earned)) calculates the sum of all the dollars earned by each person.
  • Since the query uses the SUM aggregate function, the query includes a GROUP BY clause to group on the non-aggregate variables, ?first, ?last, and ?category.
  • To sort the results to list the biggest earners first, the ORDER BY statement (ORDER BY desc(?earned) ?first ?last ?category) lists the amount spent in descending order.
Related Topics