Loading CSV Files

This topic provides instructions for loading data from CSV files.

The load method described in this topic is advanced and requires manual writing of complex queries as well as familiarity with the columns in the CSV file and their data types. You can also load data from CSV files using the Data Toolkit service. See Reading or Ingesting Instance Data for more information.

For instructions on loading TTL, NT, NQ, and TriG files, see Loading Triple and Quad Files. For information about load file directory requirements and load architecture, see Load File Requirements.

CSV Load Using INSERT Query

Load data from CSV files by writing an INSERT query that specifies the triples to create from the tabular data in the file. In the WHERE clause, include a TABLE statement that specifies the file or directory of files to load, the delimiter used in the CSV file(s), and the schema. Use the following syntax to write the INSERT query to load data from CSV files. The table below describes each of the options for the TABLE clause.

[ PREFIX ... ]
INSERT { graph_or_triple_patterns }
WHERE {
TABLE <URI>(materialType,fileSystem,delimiter,hasHeaders,columnMappingString)
  ...
}
TABLE Option Description
<URI> Provide the path to the location of the CSV file or directory of files. For example, the following URI loads one file (Abstract.csv) from the movie.csv directory on a mounted file system:
<file:/global/nfs/vpc_nfs_server/data/movie.csv/Abstract.csv>

If all CSV files in a directory use the same schema (i.e., have the same columns and data types), you can load the directory of files in one INSERT query. If the files have different columns, create an INSERT query for each file.

The following URI loads all files in the movie.csv directory. Since all files in movie.csv have the same schema, the directory can be specified as the URI, and all files in the directory will be loaded:

<dir:/global/nfs/vpc_nfs_server/data/movie.csv/> 

And this URI loads a file from S3:

<s3://mybucket/data/movie.csv/Abstract.csv>

When specifying file: or dir: to load files from a file system, the file system must be accessible from AnzoGraph. In a Docker environment, the file or directory must be stored on the AnzoGraph container file system. For instructions on copying files or directories from a local file system to the AnzoGraph file system in a Docker container, see Loading Files from the AnzoGraph File System in Docker. For more information on loading data into AnzoGraph from HDFS data sources, see Loading Files from HDFS.

materialType The Multipurpose Internet Mail Extensions (MIME) type of the files. Specify csv for uncompressed files or csvgz for gzipped CSV files.
fileSystem This option specifies which AnzoGraph servers can access and load the CSV files. Specify global if all servers have access to the files or if you are loading from S3. Specify leader if only the leader server has access. Or specify compute if each compute server has a subset of the files on its local filesystem.
delimiter Specify the character that is used as the delimiter in the CSV file, such as a pipe (|), comma (,), asterisk (*), etc.
hasHeaders Boolean value that specifies whether the file has headers. Specify true if headers exist or false if the file does not contain headers.
columnMappingString Defines the mapping for the file. The value lists each column name and data type in a comma-separated list. For example, for a CSV file with four columns (ID, Title, Category ID, and Category Name):
ID:long,Title:char,CategoryID:long,CategoryName:char

Examples

This section provides example INSERT queries and CSV files to use for learning to load tabular data to AnzoGraph.

Movie Actors Example

The example below uses the attached movie-actors CSV file, which includes data about movie titles and the actors who perform in the movies. The file includes four columns that list the Movie ID, Movie Title, Actor ID, and Actor Name. For example:

MovieID,MovieTitle,ActorID,ActorName
15400287,Mission: Impossible (film series),689567,Ving Rhames
15400287,Mission: Impossible (film series),31460,Tom Cruise
30271424,Alvin and the Chipmunks (film series),1582246,Kimberly Williams-Paisley
30271424,Alvin and the Chipmunks (film series),708892,Jason Lee (actor)
30271424,Alvin and the Chipmunks (film series),876916,Kaley Cuoco
...

Click the link below to download movie-actors.csv to your computer. Place the file on the local AnzoGraph file system or another storage location that AnzoGraph has access to.

Download the Sample Movie Actors File

The following example query loads the data from movie-actors.csv into a graph named actors. The list below the example query describes each of the components that make up the query.

Open the query below for copying

INSERT { GRAPH <actors> {
  ?MovieIRI a <Movie> ;
    <MovieTitle> ?MovieTitle .
  ?ActorIRI a <Actor> ;
    <ActorName> ?ActorName .
  ?ActorIRI <ActedIn> ?MovieIRI .
  }
}
WHERE { TABLE <file:/home/user/movie-actors.csv>
('csv','leader',',',true,'MovieID:long,MovieTitle:char,ActorID:long,ActorName:char')
BIND(IRI(CONCAT("Movie",str(?MovieID))) as ?MovieIRI)
BIND(IRI(CONCAT("Actor",str(?ActorID))) as ?ActorIRI)
}

In the example query:

  • The INSERT statement creates a graph named actors and inserts the instance data into that graph.
    INSERT { GRAPH <actors> {
  • The first triple pattern inserts an rdf:type statement to express that a MovieIRI is an instance of a Movie class, and it creates a ?MovieIRI <MovieTitle> ?MovieTitle triple to link movie titles to their MovieIRI:
    ?MovieIRI a <Movie> ;
        <MovieTitle> ?MovieTitle .

    Forming the MovieIRI value is defined later in the query and described below.

  • The second triple pattern follows the first triple pattern to define the actor triples the same way:
    ?ActorIRI a <Actor> ;
        <ActorName> ?ActorName .
  • The third triple pattern defines a relationship between the actors and the movies they act in. If the relationship was not defined, querying the data would return results that show that all of the actors star in every movie:
    ?ActorIRI <ActedIn> ?MovieIRI .
  • In the TABLE statement in the WHERE clause, the URI that defines the path to the load file shows that the file is in the user's home directory on the AnzoGraph server:
    TABLE <file:/home/user/movie-actors.csv>

    If you run the example query, make sure that you modify the load file URI to specify the location of movie-actors.csv for your setup.

  • In the parentheses after the URI, the material type is specified as 'csv'. Since this is a single node system, the file system is specified as 'leader', the delimiter used in the file is comma (','), the file includes headers (true), and it has four columns specified as the columnMappingString or schema:
    'csv','leader',',',true,'MovieID:long,MovieTitle:char,ActorID:long,ActorName:char'
  • The WHERE clause also includes the following BIND statements to define the MovieIRI and ActorIRI values using the MovieID and ActorID columns in the CSV file. The statements create an IRI from the concatenation of the text "Movie" or "Actor" and the Movie or Actor ID and bind them to the Movie and Actor IRI variables. For example, if a MovieID in the CSV file is "1234," concatenating that value with "Movie" produces a URI in the form of <Movie1234>. Constructing the URI in this manner avoids potential conflicts if there is also an ActorID value of "1234" in the CSV file.
    BIND(IRI(CONCAT("Movie",str(?MovieID))) as ?MovieIRI)
    BIND(IRI(CONCAT("Actor",str(?ActorID))) as ?ActorIRI)

After loading the data, you can run SPARQL queries to explore the actors graph. For example, this simple query returns the list of movies and the actors who star in them:

SELECT ?movie ?actor
FROM <actors>
WHERE {
  ?MovieIRI <MovieTitle> ?movie .
  ?ActorIRI <ActedIn> ?MovieIRI .
  ?ActorIRI <ActorName> ?actor .
}
ORDER BY ?movie
LIMIT 50
movie                         | actor
------------------------------+----------------------
$pent                         | Jason London
$pent                         | Phill Lewis
'Til There Was You            | Jennifer Aniston
'Til There Was You            | Dylan McDermott
'Til There Was You            | Sarah Jessica Parker
'Til There Was You            | Jeanne Tripplehorn
...And Justice for All (film) | Lee Strasberg
...And Justice for All (film) | John Forsythe
...And Justice for All (film) | Jack Warden
...And Justice for All (film) | Al Pacino
...tick...tick...tick...      | Jim Brown
10 (film)                     | Robert Webber
10 (film)                     | Dudley Moore
10 (film)                     | Julie Andrews
10 (film)                     | Bo Derek
10 Items or Less (film)       | Kumar Pallana
10 Items or Less (film)       | Paz Vega
10 Items or Less (film)       | Jonah Hill
10 Items or Less (film)       | Morgan Freeman
102 Dalmatians                | Glenn Close
102 Dalmatians                | Tim McInnerny
...
50 rows

The query below returns the 10 actors who have acted in the most movies:

SELECT ?actor (COUNT(?MovieIRI) AS ?num_movies)
FROM <actors>
WHERE {
  ?ActorIRI <ActedIn> ?MovieIRI .
  ?ActorIRI <ActorName> ?actor .
}
GROUP BY ?actor
ORDER BY desc(?num_movies)
LIMIT 10
actor             | num_movies
------------------+------------
Morgan Freeman    |         16
Bruce Willis      |         15
Robert De Niro    |         15
Eddie Murphy      |         14
Donald Sutherland |         14
Richard Dreyfuss  |         13
Steve Buscemi     |         13
Dermot Mulroney   |         13
Julia Roberts     |         13
Dean Martin       |         13
10 rows

Flight Data Example

This example is more complex than the Movie Actors example and uses a flights10k CSV file, which includes a 10,000 row subset of one day's worth of flight data in 2015 from the Department of Transportation. The file includes airport and airline information and focuses on flight delays. For example:

AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,
ELAPSED_TIME,AIR_TIME,DISTANCE...
AS,98,N407AS,ANC,SEA,194,169,1448...
AA,2336,N3KUAA,LAX,PBI,279,263,2330...
US,840,N171US,SFO,CLT,293,266,2296...
AA,258,N3HYAA,LAX,MIA,281,258,2342...
AS,135,N527AS,SEA,ANC,215,199,1448...
DL,806,N3730B,SFO,MSP,230,206,1589...
NK,612,N635NK,LAS,MSP,170,154,1299...
US,2013,N584UW,LAX,CLT,249,228,2125...
...

The following example query loads the data from flights10k.csv, stored on Amazon S3, into a graph named flights. The list below the example query describes each of the components that make up the query.

Open the query below for copying

INSERT { GRAPH <flights> {
  ?OriginIRI  a <Airport> .
  ?DestinationIRI a <Airport>  .
<< ?OriginIRI <DESTINATION> ?DestinationIRI >> <DISTANCE> ?DISTANCE .
  ?FlightIRI a <Flight> ;
    <AIRLINE> ?AIRLINE ;
    <FLIGHT_NUMBER> ?FLIGHT_NUMBER ;
    <TAIL_NUMBER> ?TAIL_NUMBER ;
    <SCHEDULED_DEPARTURE> ?SCHEDULED_DEPARTURE ;
    <DEPARTURE_TIME> ?DEPARTURE_TIME ;
    <DEPARTURE_DELAY> ?DEPARTURE_DELAY ;
    <TAXI_OUT> ?TAXI_OUT ;
    <WHEELS_OFF> ?WHEELS_OFF ;
    <SCHEDULED_TIME> ?SCHEDULED_TIME ;
    <ELAPSED_TIME> ?ELAPSED_TIME ;
    <AIR_TIME> ?AIR_TIME ;
    <WHEELS_ON> ?WHEELS_ON ;
    <TAXI_IN> ?TAXI_IN ;
    <SCHEDULED_ARRIVAL> ?SCHEDULED_ARRIVAL ;
    <ARRIVAL_TIME> ?ARRIVAL_TIME ;
    <ARRIVAL_DELAY> ?ARRIVAL_DELAY ;
    <DIVERTED> ?DIVERTED ;
    <CANCELLED> ?CANCELLED ;
    <CANCELLATION_REASON> ?CANCELLATION_REASON ;
    <AIR_SYSTEM_DELAY> ?AIR_SYSTEM_DELAY ;
    <SECURITY_DELAY> ?SECURITY_DELAY ;
    <AIRLINE_DELAY> ?AIRLINE_DELAY ;
    <LATE_AIRCRAFT_DELAY> ?LATE_AIRCRAFT_DELAY ;
    <WEATHER_DELAY> ?WEATHER_DELAY .
  ?OriginIRI <FlightDeparture> ?FlightIRI .
  ?DestinationIRI <FlightArrival> ?FlightIRI .
  }
}
WHERE { TABLE <s3://csi-notebook-datasets/Flight_Dataset/flights10k.csv>
('csv','global',',',true,'YEAR:char,MONTH:char,DAY:char,DAY_OF_WEEK:char,
AIRLINE:char,FLIGHT_NUMBER:char,TAIL_NUMBER:char,ORIGIN_AIRPORT:char,
DESTINATION_AIRPORT:char,SCHEDULED_DEPARTURE:char,DEPARTURE_TIME:char,
DEPARTURE_DELAY:char,TAXI_OUT:char,WHEELS_OFF:char,SCHEDULED_TIME:char,
ELAPSED_TIME:int,AIR_TIME:int,DISTANCE:int,WHEELS_ON:char,TAXI_IN:char,
SCHEDULED_ARRIVAL:char,ARRIVAL_TIME:char,ARRIVAL_DELAY:char,DIVERTED:char,
CANCELLED:char,CANCELLATION_REASON:char,AIR_SYSTEM_DELAY:char,SECURITY_DELAY:char,
AIRLINE_DELAY:char,LATE_AIRCRAFT_DELAY:char,WEATHER_DELAY:char')
  BIND(IRI(CONCAT(CONCAT("Flight",str(?FLIGHT_NUMBER),str(?TAIL_NUMBER)))) as ?FlightIRI)
  BIND(IRI(str(?ORIGIN_AIRPORT)) as ?OriginIRI)
  BIND(IRI(str(?DESTINATION_AIRPORT)) as ?DestinationIRI)
}

In the example query:

  • The INSERT statement creates a graph named flights that will contain all of the instance data defined by the query:
    INSERT { GRAPH <flights> {
  • The first two triple patterns insert an rdf:type statement to express that OriginIRI and DestinationIRI are instances of the Airport class:
    ?OriginIRI a <Airport> .
    ?DestinationIRI a <Airport> .

    Forming the OriginIRI and DestinationIRI values is defined later in the query and described below.

  • The statement below the first two triple patterns uses RDF* notation to define a property called DISTANCE on the DESTINATION predicate. The values for the DISTANCE column in the CSV file will be inserted as property values (<DISTANCE> ?DISTANCE) on the DESTINATION edges between the OriginIRI and the DestinationIRI.
    << ?OriginIRI <DESTINATION> ?DestinationIRI >> <DISTANCE> ?DISTANCE .

    For more information about properties, see Creating and Querying Labeled Property Graphs (RDF*).

  • The next, large triple pattern specifies that a FlightIRI is an instance of the Flight class, and it creates a FlightIRI <column_name> ?variable_name triple for the rest of the columns in the CSV file:
    ?FlightIRI a <Flight> ;
        <AIRLINE> ?AIRLINE ;
        <FLIGHT_NUMBER> ?FLIGHT_NUMBER ;
        <TAIL_NUMBER> ?TAIL_NUMBER ;
        <SCHEDULED_DEPARTURE> ?SCHEDULED_DEPARTURE ;
    ...

    Forming the FlightIRI value is defined later in the query and described below.

  • At the end of the large triple pattern, the following triples create a relationship between the OriginIRI and the FlightIRI as well as the DestinationIRI and FlightIRI:
    ?OriginIRI <FlightDeparture> ?FlightIRI .
    ?DestinationIRI <FlightArrival> ?FlightIRI .
  • In the TABLE statement in the WHERE clause, the URI that defines the path to the load file shows that the file is hosted on Amazon S3 in the csi-notebook-datasets bucket in the Flight_Dataset directory:
    TABLE <s3://csi-notebook-datasets/Flight_Dataset/flights10k.csv>
  • In the parentheses after the URI, the material type is specified as 'csv'. Since all AnzoGraph nodes have access to S3, the file system is specified as 'global', the delimiter used in the file is comma (','), the file includes headers (true), and the columnMappingString defines the schema for the 31 columns in the file:
    'csv','global',',',true,'YEAR:char,MONTH:char,DAY:char,DAY_OF_WEEK:char,
    AIRLINE:char,FLIGHT_NUMBER:char,TAIL_NUMBER:char,ORIGIN_AIRPORT:char,
    DESTINATION_AIRPORT:char,SCHEDULED_DEPARTURE:char,DEPARTURE_TIME:char,
    DEPARTURE_DELAY:char,TAXI_OUT:char,WHEELS_OFF:char,SCHEDULED_TIME:char,
    ELAPSED_TIME:int,AIR_TIME:int,DISTANCE:int,WHEELS_ON:char,TAXI_IN:char,
    SCHEDULED_ARRIVAL:char,ARRIVAL_TIME:char,ARRIVAL_DELAY:char,DIVERTED:char,
    CANCELLED:char,CANCELLATION_REASON:char,AIR_SYSTEM_DELAY:char,SECURITY_DELAY:char,
    AIRLINE_DELAY:char,LATE_AIRCRAFT_DELAY:char,WEATHER_DELAY:char'

    Even though the INSERT query does not reference every column in the load file, and the data from some columns will not be loaded into the flights graph, the columnMappingString must list all of the columns in the CSV file.

  • The WHERE clause also includes the following BIND statements to define the FlightIRI, OriginIRI, and DestinationIRI values and bind them to variables. To ensure that the FlightIRI values are unique, the IRI is formed by concatenating the string "Flight" with the Flight Number and Tail Number values. The OriginIRI is created by converting the Origin Airport values to IRIs, and the DestinationIRI is created by converting the Destination Airport values to IRIs.
    BIND(IRI(CONCAT(CONCAT("Flight",str(?FLIGHT_NUMBER),str(?TAIL_NUMBER)))) as ?FlightIRI)
    BIND(IRI(str(?ORIGIN_AIRPORT)) as ?OriginIRI)
    BIND(IRI(str(?DESTINATION_AIRPORT)) as ?DestinationIRI)

After loading the data, you can run SPARQL queries to explore the flights graph. For example, this query returns the 20 flights that took the longest time. It also lists the flight distance:

SELECT ?Origin ?Destination ?elapsed_time ?distance
FROM <flights>
WHERE {
  ?Origin <FlightDeparture> ?Flight .
  ?Destination <FlightArrival> ?Flight .
  << ?Origin <DESTINATION> ?Destination >> <DISTANCE> ?distance .
  ?Flight  <ELAPSED_TIME>   ?elapsed_time .
}
ORDER BY DESC(?elapsed_time)
LIMIT 20
Origin | Destination | elapsed_time | distance
-------+-------------+--------------+----------
JFK    | HNL         |          711 |     4983
JFK    | HNL         |          660 |     4983
IAD    | HNL         |          651 |     4817
EWR    | HNL         |          647 |     4962
ATL    | HNL         |          605 |     4502
HNL    | JFK         |          566 |     4983
ORD    | HNL         |          555 |     4243
ORD    | OGG         |          539 |     4184
HNL    | IAD         |          536 |     4817
DFW    | HNL         |          526 |     3784
IAH    | HNL         |          521 |     3904
DFW    | HNL         |          520 |     3784
DFW    | OGG         |          517 |     3711
DFW    | OGG         |          502 |     3711
HNL    | ORD         |          477 |     4243
PHX    | HNL         |          455 |     2917
CLT    | PHX         |          455 |     1773
JFK    | SFO         |          453 |     2586
HNL    | GUM         |          452 |     3801
DEN    | LIH         |          448 |     3414
20 rows

The query below returns a list of the 10 most connected airports by calculating the PageRank for the graph:

SELECT *
FROM <flights>
WHERE { SERVICE <csi:page_rank> {
  [] <csi:binding-vertex> ?airport ;
  <csi:binding-rank> ?rank ;
  <csi:edge-label> <DESTINATION> .
  }
  ?airport a <Airport> .
}
ORDER BY DESC(?rank)
LIMIT 10
airport | rank
--------+-----------
ORD     | 13.412187
DFW     | 13.163151
ATL     | 12.807295
DEN     | 10.381292
IAH     |  8.381458
SLC     |  6.901801
MSP     |  6.243941
SFO     |  5.507264
PHX     |  5.274814
LAX     |  5.245441
10 rows
Related Topics