Guest post by Thomas Denny Microsoft Student Partner at the University of Oxford
About Me
Hi I am currently studying Computer Science at the University of Oxford.
I am the president (2017-18) and was the secretary (2016-17) of the Oxford University Computer Society, and a member of the Oxford University History and Cross Country societies. I also lead a group of Microsoft Student Partners at the university to run talks and hackathons.
F#
F# is an incredibly flexible language, and amongst its many benefits is the ability to use type providers to access and manipulate data from external sources. A type provider allows you to create a .Net type at runtime without the need to declare the type in code - this facility is not dissimilar to LISP's macro features. In F# you might use a type provider in place of a code generation, e.g. for writing wrapper types for a database schema. In this article we use a web page to generate a type that we then use for extracting data from other similar pages, and then we look at how to extract data from a CSV file.
Getting started
So long as you have F# and NuGet installed you can follow this guide using any editor, but you can make your experience a little easier by also installing Visual Studio Code and the Ionide F# plugin. This plugin has several useful features, but the most useful are its IntelliSense and type annotations features, which are even available for types created by a type provider!
Visual Studio Code
Once you're setup you'll need to install the F# Data package from NuGet :
PM> Install-Package FSharp.Data -Version 2.3.3
Wikipedia tables
Parsing and consuming data from HTML is traditionally a heavy task requiring a large amount of code; often a task as simple as extracting the column names of a table will require dozens of lines of code.
We're going to take a look at a simple problem: each year the cast and crew members of a film will often win several different awards (e.g. Academy Award, Golden Globe), and we would like to find the names of the cast or crew members that won the most awards for that particular film.
To start off with, we'll take a look at the accolades received by Spotlight , 2016's Best Picture winner at the Oscars. The results are presented in a table like this:
Example table
To start off with, we need to use the HTML type provider to create a new type based on this page. Create a new file called
awards.fsx
(an F# script):
#r "FSharp.Data.2.3.3/lib/net40/FSharp.Data.dll"
open FSharp.Data
type AccoladeData = HtmlProvider<"https://en.wikipedia.org/wiki/List_of_accolades_received_by_Spotlight_(film)">
Next, we have to request the data for that specific page
let spotlightData = AccoladeData.Load("https://en.wikipedia.org/wiki/List_of_accolades_received_by_Spotlight_(film)")
spotlightData
is an object of type
AccoladeData
, which has properties
Html
,
Tables
, and
Lists
- this is standard across all types created by the HTML type provider. However, the properties available off each of these properties varies based on the schema that the type was provided by. In our case, the
Tables
property has an
Accolades
property, which contains the table data from the page. If you use the Ionide plugin with Visual Studio Code, as described above, you can see this in the IntelliSense suggestions:
IntelliSense suggestions
Collecting the results together can be done in a few lines of F#. We need to do the following:
- Filter out any results that were not wins
- Group results by the winner
- Count the number of wins for each winner
- Sort the winners by number of wins
This can be done as a simple F# function that takes the accolade table as an argument:
let awardNumbers (data: AccoladeData) =
data.Tables.Accolades.Rows
|> Seq.filter (fun row -> row.Result = "Won")
|> Seq.groupBy (fun row -> row.``Recipient(s) and nominee(s)``)
|> Seq.map (fun (person, awards) -> (person, Seq.length awards))
|> Seq.sortByDescending (fun (person, count) -> count)
Each table row is also of a type constructed by the type provider, and it will have properties for each column (e.g. the result, the recipient, etc). Finally, we can print the results:
for (person, count) in awardNumbers spotlightData do
printfn "%s,%d" person count
Whilst this example is interesting for a single page, what about other pages with the same table of data? Simply by changing the URL that we load from we can also print the same results for another film:
let moonlightData = AccoladeData.Load("https://en.wikipedia.org/wiki/List_of_accolades_received_by_Moonlight_(2016_film)")
for (person, count) in awardNumbers moonlightData do
printfn "%s,%d" person count
Finally, we could then collect this data for several films at once in parallel and then print the results for each film:
let urls = [
"https://en.wikipedia.org/wiki/List_of_accolades_received_by_Spotlight_(film)"
"https://en.wikipedia.org/wiki/List_of_accolades_received_by_Moonlight_(2016_film)"
"https://en.wikipedia.org/wiki/List_of_accolades_received_by_La_La_Land_(film)"
]
let allMovies =
urls
|> Seq.map AccoladeData.AsyncLoad
|> Async.Parallel
|> Async.RunSynchronously
|> Seq.map awardNumbers
for movie in allMovies do
for (p,c) in movie do
printfn "%s,%d" p c
Extracting data from CSVs
The F# Data package also provides a type provider for CSV files. Much like the HTML provider, you can also access all the column names as properties. Here's a simple example that extracts data from the British Government's list of MOT testing stations :
let [<Literal>] MOTUrl =
"https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/613984/active-mot-testing-stations.csv"
// No need to specifically declare a type from the type provider if we are
// loading from one source
let data = new CsvProvider<MOTUrl>()
let stationsPerArea =
data.Rows
// Once again, column headers are the properties
|> Seq.groupBy (fun row -> row.``VTS Address Line 4``)
|> Seq.map (fun (location, rows) -> (location, Seq.length rows))
|> Seq.sortBy (fun (location, count) -> count)
for (area, count) in stationsPerArea do printfn "%s,%d" area count
Conclusion
This is just a small glimpse of what you can do with F# data providers - the F# Data package also includes data providers for JSON files, for example.
Extra reading
-
Type Providers in the F# guide
- Try F# Online and learn more about F# at
-
Microsoft Research F# https://www.microsoft.com/en-us/research/project/f-at-microsoft-research/#