Guest post by Nathan Lucaussy, Microsoft Student Partner at Oxford University.
Hello! I hope you'll enjoy the following blog post - it details the particular kind of algorithm that got me so excited about studying Computer Science in the first place! I'm Nathan, a second-year student reading Computer Science and Philosophy at the University of Oxford. I'm mainly interested in Machine Learning, Algorithms and Concrete AI Safety (ensuring AI algorithms do as they're told) - but in my spare time I enjoy playing the guitar and travelling. Reach me on LinkedIn:
https://www.linkedin.com/in/nathan-lucaussy-60a16816a/
In this blog post we'll be getting used to F#'s functional style before deploying it for some data analysis in Azure Notebooks: our main task will be modelling the temperature of London over 3 years. We'll start off with
plotting a time series and cleaning the dataset.
This will introduce us to
In the second half of this post, we will look at devising a Genetic Algorithm for fitting a sinusoidal curve to temperature data. By devising this regression algorithm, we'll be doing some
Data Science
, but we'll also be looking at important F# features:
We will do all of this using Azure Notebooks (in fact, in this very notebook!). Azure Notebooks is a free, online platform in the Microsoft Cloud providing an interactive development environment in F# but also Python and R. Its interactivity is particularly useful for data analysis, allowing instant visualisation of results.
The most widely used library for data science in F# is FsLab. It provides a number of packages, of which we will use:
Azure Notebooks natively supports Paket (the dependency manager for .NET's - and by extension F#'s - package repository NuGet). Follow the steps below to load the required packages directly from the NuGet repository:
1) #load "Paket.fsx" enables Paket within the Azure Notebooks environment.
2) Paket.Dependencies.Install """ ... """ This adds dependencies from the Nuget repository
3) Paket.Package ["FsLab"] generates dependencies for the downloaded packages
4) #load "Paket.Generated.Refs.fsx" to perform the actual referencing
Finally, we use the "open" keyword to open a namespace to the Notebook environment - much like Python's import.
#load "Paket.fsx" Paket.Dependencies.Install """ frameworks: net45 source https: //nuget.org/api/v2 nuget FSharp.Data nuget XPlot.Plotly """ Paket.Package["XPlot.Plotly" "FSharp.Data"]#load "XPlot.Plotly.Paket.fsx"#load "XPlot.Plotly.fsx"#load "Paket.Generated.Refs.fsx" open System open FSharp.Data open XPlot.Plotly A.1 Preparing and cleaning the data
We now need to load in the weather data from the file Condition_Sunrise.csv. This is the data we will want to perform our analytics on. This is where F# really shines - F# offers
type providers
: an extremely efficient way to parse data and metadata from structured sources to an F#-intelligible schema. We make use of the CSV type provider.
The following type declaration:
1. type Weather = CsvProvider < "/home/nbuser/library/Condition_Sunrise.csv" >
F# specific:
We are introduced to another of F#'s functional features:
let bindings
. In imperative languages, variables are bound to a memory address in which a value is placed - this may then be changed with another value. In F# let bindings bind an identifier with a value or function - they are
immutable
. Related to this idea is the fact that when used functionally F# treats instructions as expressions. That there is no real notion of state in a functional program makes it so much simpler to reason about program semantics.
We may now load the data from our CSV file - this is done by loading into an instance of the type given by the type provider:
let my_data = Weather.Load("/home/nbuser/library/Condition_Sunrise.csv")The object created is a schema which may be converted to an iterable sequence by calling the function Rows
F# specific:
Notice the |> infix operator (pronounced pipe forward). It passes its first argument as an argument to it's second argument, a another function. This operator makes a huge difference with regards to readability of code, especially when performing data transforms on arrays.
Azure Notebooks' interactivity means we can read the first row of our CSV data. Immediately, we observe that some of the data will be of no use to us - we only need the data in the first two columns: DateTime and Temp.
let first_row = my_data.Rows | >; Seq.head first_row Out: ("December 12, 2012 at 07:07AM", 29, "Partly Cloudy", 46, 30)To select the required data we use an array comprehension, creating pairs of elements comprising of only the date and temperature.
let data_array = [ | for row in my_data.Rows - >; (row.DateTime, row.Temp) | ]
We can lighten the array even further. Because the first column gives time at sunrise and second column represents the temperature at sunrise, we remove the time from each string of time.
To do this, we split the partition the string before and after the keyword 'at', and take the initial portion of the split string, using the Array.head function.
let removeTimeFromDateString(str: string) = str.Split([ | " at " | ], StringSplitOptions.None) | >; Array.head
F# specific:
We now have an array of tuples all of which have a string as a first element. But this string represents a date! How do we parse it as machine-understandable time? Here F#'s
.NET Framework
integration proves very useful: the DateTime package provides a function
Parse
that correctly parses our date format. ToOADate then converts DateTime objects into a numerical date format, to which we subtract the first date to make numbers more manageable i.e. starting from 0.
F# specific:
In functional languages, Higher-Order Functions are prevalent. These are functions that either take in functions as arguments or return functions given arguments (or both). We have already met one: |>
(pipe forward)
. Note below the use of Array.map - it applies a function to every single element of an array.
_F# specific _ Lambdas, or anonymous functions, are often used in functional languages. They act like regular functions except they are unnamed - the syntax for defining lambdas in F# is: fun x -> 2*x, as an example. Below an anonymous function is used as argument to Array.map
let pruned_array = Array.map(fun(x: string, y: int) - >; ((x | > removeTimeFromDateString | > System.DateTime.Parse).ToOADate() - 41255.0, y)) data_array let date_values = pruned_array | >; Array.map fst let temp_values = pruned_array | >; Array.map snd A.2 Plotting temperatures as a function of time using XPlot.Plotly
Since XPlot.Plotly's namespace is open to our environment, we can now create XPlot objects. We choose to create a Scatter object (corresponding to the data organisation of a scatter-plot) because we have more than 1000 data points and a histogram, for example, would hinder clarity.
Note that the x and y series are passed in as arrays.
let trace1 = Scatter(x = (pruned_array | >; Array.map fst), y = (pruned_array | > Array.map snd), name = "Temperatures")Azure Notebooks allows for wonderful inline plotting:
trace1 | > Chart.Plot
With these basics in place, we can start the regression process. Ultimately, we aim to provide a line of best fit giving temp_values as a function of date_values
The genetic algorithm is used to solve
optimisation problems
by mimicking the process of
evolution
. Given a candidate solution - an
Individual
's particular characteristics (usually called
Traits
), we may generate a
Population of Individuals
each differing slightly from the source Individual in its Traits through randomnness. Having devised a way of
ranking individuals
(
Fitness
) in a Population, we perform a crossover of the
best Individuals
with the rest of the
Population
, adding in some
randomness
to escape local efficiency maxima. Each generation improves on the previous one, hence approximating an optimal solution.
For a graphical illustration of the Genetic Algorithm process, the following video, where a genetic algorithm learns to walk, may be of interest: https://youtu.be/xcIBoPuNIiw
· In our case, the individuals are four-tuples of Double values, for which we create a type Individual: they correspond to the values
(a,b,c,d)
in the family of the functions of the form a×(sin(b×x+c))+d. Such tuples capture all possible sinusoidal functions.
· We devise additional types : a Population will be a list of Individuals, Parents a pair of Individuals
When devising this large piece of code, which will ultimately run as a single function, we will use a design heuristic called wholemeal programming : never once will we look at the individual data contained within the data arrays or the list of individuals. Instead, we will repeatedly apply functions to the whole of the population. This kind of programming is distinctly functional.
type Individual = double * double * double * double type Population = Individual list type Parents = Individual * IndividualCrucial to the genetic algorithm is a function that inserts randomness at various stages of the process - the following adds or removes up to 10% of a Double value:
let addTenPercRandom(random_gen: Random)(x: double): double = x * ((double(random_gen.Next(-100, 100)) / 1000.) + 1.)We also need a higher-order function that applies a function _f_ to every element of our 4-tuple individuals. You might recognise this as a map , and indeed it is the natural map on the tuple structure.
let tupleMap(f: Double - > Double)(w, x, y, z) = (f w, f x, f y, f z) B.2 Building the initial population
Because Genetic Algorithms are prone to getting stuck at local maxima, it is often useful to introduce a guess for the starting individual. We build our first population's generation around this individual.
F# Specific:
When writing programs in a functional style, we aim to avoid using loops (indeed, in purely functional languages like Haskell, it is very hard, near impossible, to do so). The functional alternative is using recursion. The
let rec
keyword instructs the compiler that this is a recursive function. We pass a parameter
count
which is decreased at each iteration, until we reach a base case, 0.
F# Specific: To handle the base case and recursive cases differently, we use another distinctively functional feature: pattern matching. The match keyword compares the value of size with 0 or any other integer, defaulting to the empty list in the 0 case and building the list recursively when non-zero using the :: (cons) operator - it appends a first element to a list.
let makeIndividual(random_gen: Random)(guess_tuple: Individual): Individual = (tupleMap(addTenPercRandom random_gen) guess_tuple) let rec makePopulation(random_gen: Random)(size: int)(guess_tuple: Individual): Population = match size with | 0 - >; List.empty | n - >; (makeIndividual random_gen guess_tuple):: makePopulation random_gen(size - 1) guess_tuple B.3 Evaluating the fitness of an individual
The second ingredient of the Genetic Algorithm is a function that evaluates the performance of an individual at the given task - called a
fitness function
.
In our case, it consists in approximating the temperature values - for this we create a function which calculates the image of the sine function for a given date and given individual. This is the purpose of
findSinValue
.
Fitness will be a measure of statistical squares: the sum of the squares of the differences between the simulated value and the actual temperature value, for each value in the temperature value array.
We define the type Result as a record of the Individual and the Fitness of that given individual - so that for clarity they are paired up.
The function
simulate
is defined in a very functional way:
Given a previously generated population, how do obtain a new, improved generation?
We devise a mechanism for crossing-over two individuals' traits. For balance, each parent gives half of the traits - there are thus 6 ways to arrange the traits. The
merge
function gives one of these ways depending on the number passed to it as argument.
The
crossOver
function selects a random way to merge parents' traits by passing one of six random numbers to the
merge
function.
We can now crossover individuals. For a whole population, we first extract the top-ranking half of the population:
This is done by composing the three functions List.sortBy, List.take, List.map.
From the top half we extract the best two individuals:
This process yields a new generation that is at least as good as the previous one.
1. let rng = Random() 1. let merge n(a, b, c, d)(a ',b', c ',d') = 2. match n with 3. | 0 - >; (a, b, c ',d') 4. | 1 - >; (a ',b', c, d) 5. | 2 - >; (a ',b,c', d) 6. | 3 - >; (a, b ',c,d') 7. | 4 - >; (a ',b,c,d') 8. | 5 - >; (a, b ',c', d) 9. | _ - >; raise(System.ArgumentException("There are only six cases!")10. ) 11. 1. let crossOver(parents: Parents): Individual = 2. let randomCrossingOrder = rng.Next(6) 3. merge randomCrossingOrder(parents | >; fst)(parents | > snd) 4. 5. let generateNextPopulation(mutatePopulation: Population - >; Population)(crossOver: Parents - > Individual)(results_generation: Result list): Population = 6. let best_individuals = results_generation 7. |>; List.sortBy(fun result - > result.Fitness) 8. //sort list in order of ascending mean squares 9. |>; List.take((results_generation.Length + 2) / 2) 10. //take the best half of the generation 11. |>; List.map(fun result - > result.IndividualTested) 12. //retrieve indivudals from elements of type Result record 13. best_individuals 14. |>; function | head::second::tail - > 15. (head::second::16. mutatePopulation17. ([18. tail | >; List.map(fun individual - > crossOver(head, individual));19. tail | >; List.map(fun individual - > crossOver(second, individual))20. ] | >; List.concat)) 21. | _ - >; 22. raise(System.ArgumentException("Population not large enough to crossover!")) B.5 Repeating the simulation over 100 generations
Before running the evolution 100 times, we need to set starting parameters:
· As population size, we choose to have 1000 individuals so that there is sufficient variety of traits, including outliers which will avoid local maxima.
· As starting guess, we use basic mathematical properties to find start values by inspection:
a
is the amplitude of the periodical pattern,
b
is 2π÷period2π÷period(since this is a yearly phenomenon, we estimate the period to be approximately 365 days),
c
is the phase shift and
d
is the vertical shift.
· the repeatSimulation function is a natural candidate for recursion, taking the new generation's population each time. We obtain fitness for each individual using the previously-defined
simulate
function. We find the best individual via minimum squares-sum at each generation; for verbosity purposes this is printed each time. Finally, when we reach the last generation, the best individual is returned.
Notice how this last function uses only previously defined functions to manipulate data without ever looking at it. This is 'wholemeal programming'.
1. let starting_guess = (30., 0.015, (-20.), 50.) let starting_size = 1000 2. let starting_population = makePopulation rng starting_size starting_guess 3. let mutation(pop: Population): Population = pop | >; List.map(tupleMap(addTenPercRandom rng)) 4. let rec repeatSimulation max round_number pop = 5. match round_number with 6. | count when count = max - >; pop | > List.head 7. | count - >; 8. let generation_results = pop | > List.map simulate 9. let best = generation_results | >; List.minBy(fun result - > result.Fitness) 10. printfn "Best fitness in this generation: %A" 11. best.Fitness 12. let new_generation = generation_results 13. | >; generateNextPopulation mutation crossOver new_generation 14. | >; repeatSimulation max(count + 1) 1. repeatSimulation 100 0 starting_population yields: Best fitness in first generation: 155707.3092Best fitness in 100th generation: 92311.57632
We obtain a close solution: the sum of squares is 92048.62, which translates to a Root-Mean-Square of approximately 7.5. Given how the data has high variance with respect to a least-mean-squares fit of a sinusoidal curve, this is a very good result.
To illustrate the fit, let's go back to the original graph. We can compute the values for our sinusoidal model and overlay it on the Plotly Chart:
It is safe to say we have observed some of F#'s most tremendous capabilities, notably through it's interoperability with the .NET Framework, Type Providers (FSharp.Data - but they also give access to R and Python packages) and strong type system. These features make it particularly suited for data analysis. Note that the Genetic Algorithm is a fun way to learn about functional programming because it is simple to understand; however it is not very efficient. Watch this space for presentations of more efficient Machine Learning algorithms in F# using Azure Notebooks!
Try this in Azure NotebooksAzure Notebook Version:
https://notebooks.azure.com/anon-ioqeiw/libraries/FSharpAzureDataScience/html/Intro_to_FSharp_A...
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.