First published on MSDN on Jul 04, 2017
Guest post by
Microsoft Student Partner at the University of Oxford
Hi I am currently studying Computer Science at the University of Oxford.
I am the president (2017-18) and was the secretary (2016-17) of the Oxford University Computer Society, and a member of the Oxford University History and Cross Country societies. I also lead a group of Microsoft Student Partners at the university to run talks and hackathons.
F# is an incredibly flexible language, and amongst its many benefits is the ability to use
to access and manipulate data from external sources. A type provider allows you to create a .Net type at runtime without the need to declare the type in code - this facility is not dissimilar to LISP's macro features. In F# you might use a type provider in place of a code generation, e.g. for writing wrapper types for a database schema. In this article we use a web page to generate a type that we then use for extracting data from other similar pages, and then we look at how to extract data from a CSV file.
So long as you have
installed you can follow this guide using any editor, but you can make your experience a little easier by also installing
Visual Studio Code
plugin. This plugin has several useful features, but the most useful are its IntelliSense and type annotations features, which are even available for types created by a type provider!
Visual Studio Code
Once you're setup you'll need to install the
PM> Install-Package FSharp.Data -Version 2.3.3
Parsing and consuming data from HTML is traditionally a heavy task requiring a large amount of code; often a task as simple as extracting the column names of a table will require dozens of lines of code.
We're going to take a look at a simple problem: each year the cast and crew members of a film will often win several different awards (e.g. Academy Award, Golden Globe), and we would like to find the names of the cast or crew members that won the most awards for that particular film.
To start off with, we'll take a look at the
accolades received by Spotlight
, 2016's Best Picture winner at the Oscars. The results are presented in a table like this:
To start off with, we need to use the HTML type provider to create a new type based on this page. Create a new file called
(an F# script):
Next, we have to request the data for that specific page
is an object of type
, which has properties
- this is standard across all types created by the HTML type provider. However, the properties available off each of these properties varies based on the schema that the type was provided by. In our case, the
property has an
property, which contains the table data from the page. If you use the Ionide plugin with Visual Studio Code, as described above, you can see this in the IntelliSense suggestions:
Collecting the results together can be done in a few lines of F#. We need to do the following:
Filter out any results that were not wins
Group results by the winner
Count the number of wins for each winner
Sort the winners by number of wins
This can be done as a simple F# function that takes the accolade table as an argument:
let awardNumbers (data: AccoladeData) =
|> Seq.filter (fun row -> row.Result = "Won")
|> Seq.groupBy (fun row -> row.``Recipient(s) and nominee(s)``)
|> Seq.map (fun (person, awards) -> (person, Seq.length awards))
|> Seq.sortByDescending (fun (person, count) -> count)
Each table row is also of a type constructed by the type provider, and it will have properties for each column (e.g. the result, the recipient, etc). Finally, we can print the results:
for (person, count) in awardNumbers spotlightData do
printfn "%s,%d" person count
Whilst this example is interesting for a single page, what about other pages with the same table of data? Simply by changing the URL that we load from we can also print the same results for another film:
Finally, we could then collect this data for several films at once in parallel and then print the results for each film:
The F# Data package also provides a type provider for CSV files. Much like the HTML provider, you can also access all the column names as properties. Here's a simple example that extracts data from the British Government's list of
MOT testing stations
let [<Literal>] MOTUrl =
// No need to specifically declare a type from the type provider if we are
// loading from one source
let data = new CsvProvider<MOTUrl>()
let stationsPerArea =
// Once again, column headers are the properties
|> Seq.groupBy (fun row -> row.``VTS Address Line 4``)
|> Seq.map (fun (location, rows) -> (location, Seq.length rows))
|> Seq.sortBy (fun (location, count) -> count)
for (area, count) in stationsPerArea do printfn "%s,%d" area count
This is just a small glimpse of what you can do with F# data providers - the F# Data package also includes data providers for JSON files, for example.