Importance and goals of filtering GTFS
One of the goals of this article is to introduce software developers to GTFS feeds, their structure, and the relations between each of the files. The main goal is to explain the filtering process of the GTFS files.
GTFS stands for General Transit Feed Specification. GTFS feeds are well-known and useful datasets that describe transit data of public transport agencies. GTFS ”feeds” allow public transit agencies to publish their transit data and developers to write applications that consume that data in an interoperable way.
There are many reasons why the data of GTFS files are used by many applications nowadays for various purposes. Just mentioning the fact that the GTFS files are created and supported by Google, which is among the biggest and most stable company in the world, we can presume that the GTFS will be used, supported and developed for many years. The conclusion is that they are worth to work with it. Another reason is that nowadays, many applications work with GTFS such as Google Maps, Bing Maps, OpenTripPlanner, Graphserver, OneBusAway, IVR, and many more.
It is very important to mention that the GTFS files can be easily acquired, meaning that the GTFS files can be downloaded in a simple zip file. Agencies who provide GTFS feeds can be found at the end of this article.
In the beginning, the GTFS’s goal was its simplicity, so small agencies could easily adopt the standard. For this reason, the GTFS uses comma-separated values (CSV) files.
As already mentioned the GTFS feed is a compressed zip file, containing CSV files that provide data of a transit system. To check out an online document that explains in more details all of the types of files that can be included in a GTFS transit feed, check the link located at the end of the article. The online reference provides details on the files and fields, but the following table includes only the required files for a GTFS feed.
|agency.txt||Data on the transit agency providing the feed.|
|calendar.txt||A schedule of when a trip is active.|
|routes.txt||This defines the many transit routes available.|
|stop_times.txt||The actual times associated with a stop in stops.txt.|
|stops.txt||This specifies individual stops a certain transit run makes.|
|trips.txt||This defines individual trips or runs within a route.|
The next ER diagram shows the relationships between each of the required GTFS files.
The diagram above illustrates the relationships between each of the files and shows how the trips and routes are the core relationship entities. As our article later focuses on filtering the GTFS files by the route type, we will focus on the route entity.
The most important objective of this article is how to extract only needed data for the application. In the next part, the route.txt file is explained in detail. This article assumes that the application will filter GTFS files depending on the type of transport used.
The main field, which is being analyzed in this article, is the “route_type” field located in the routes file. This is the entry point field used to filter the dataset. The routes file has the following fields:
|route_id||Required||Uniquely identifies a route|
|agency_id||Optional||Value which is referenced from the agency.txt file|
|route_short_name||Required||The short name of a route|
|route_long_name||Required||Full name of a route|
|route_desc||Optional||Description of a route|
|route_url||Optional||URL of a web page about that route|
|route_color||Optional||Color that corresponds to a route|
|route_text_color||Optional||A legible color to use for text drawn against a background of route_color|
|route_type||Required||Describes the type of transportation used on a route. Valid values for this field are:
0 – Tram, Streetcar, Light rail. Any light rail or street level system within a metropolitan area.
1 – Subway, Metro. Any underground rail system within a metropolitan area.
2 – Rail. Used for intercity or long-distance travel.
3 – Bus. Used for short- and long-distance bus routes.
4 – Ferry. Used for short- and long-distance boat service.
5 – Cable car. Used for street-level cable cars where the cable runs beneath the car.
6 – Gondola, Suspended cable car. Typically used for aerial cable cars where the car is suspended from the cable.
7 – Funicular. Any rail system designed for steep inclines.
This topic is very huge and the article will not cover all of the extended route types. More information and discussions about the Extended GTFS Route Types can be found on the provided link at the end of the article.
Filtering GTFS files
After the short introduction to GTFS files and their structure, we will explain the process of filtering the GTFS files by their route type which by now we know that it can be one of the following values: Tram, Subway, Rail, Bus, Ferry, Cable car, Gondola and Funicular.
There are several goals to be achieved by filtering the GTFS files. Every software developer has to keep in mind that memory usage is one of the essences and has to be interested in the efficient use of resources. One of the goals that are being achieved by filtering the GTFS is that we are minimizing the size and space consumption of the GTFS files. The next goal is not to have unnecessary data, thus only to have data that the application works with. The first two goals lead to the main reason for filtering the GTFS data and that is to achieve speed and optimization. As an example, one of the planning & analyzing tools application such as the OpenTripPlanner, which is, to be more accurate, an open source multi-modal trip planner. It uses the imported GTFS files to plan multi-modal walking, wheelchair, bicycle and transit trips and exposes a web services API, which other apps or front-ends can build on. Although the OpenTripPlanner is relatively very fast while calculating the multi-modal trip, it can be slightly faster, by filtering by the route type (transit mode). In a case where an application plans the trip of a user using the public transport of buses, the GTFS files need to be filtered by the “3 – Bus” route type. The conclusion is that the application would not have the unnecessary data, such as Tram, subway, metro and more with their routes, trips, shapes, stops, stop times, etc., but containing only the data needed for the Buses in the example mentioned.
The way that the GTFS files are filtered by route type is done in several steps, as it will be shown on the flow diagram below.
The first step is to filter routes file by the route type mode and get the needed route ids. The next step is, that the trips file is filtered with the received route ids and gets the next needed 2 filter variables, trip ids and the shape ids from the filtered trips file. Using the shape ids the shapes file is filtered. Using the trip ids the stops times file is filtered and extracts the stop ids, which are needed for the last step in the process of filtering. The last step is to use the stop ids to filter the stops file. After each of the filtering step processes the new filtered file is saved to the file system. Java project with this implementation can be found on github repository provided in references.
GTFS-filter GitHub repository
Agencies who provide GTFS feeds:
Open Trip Planner
Online reference of GTFS files and fields
Extended GTFS Route Types