Findmypast Tech

GRAPHical Family Trees

Reading time: 6 min

This post uses a graph database to show your family tree to you in a new way and shows you how to query it too. It serves as a basic introduction to Graph Databases and shows what a powerful tool they are.

What are we doing?

At Findmypast we have a team of engineers working on Search. We recently had the chance to use Neo4J (a graph database) to do some interesting work. It’s worth sharing what Graph Dbs are, what they can do, and some of the fun you can have with them.

If you want to play along, there’s a github repo for those in FMP, or a zip file containing everything you need to get started.

Interlude…

First, what’s a graph database? What’s a graph? This is a graph isn’t it?

No. This is not a graph. Well, not for the purposes of this blog-post anyway. This is perhaps more technically known as a chart. Charts might have X and Y axes, or segments of pies, or pretty bars and wiggly lines… But these are not GRAPHS.

A Graph could be defined as being made up of vertices which are connected by edges. A graph may be undirected, meaning that there is no distinction between the two vertices associated with each edge, or its edges may be directed from one vertex to another.

Fascinating stuff, I’m sure you’ll agree, but let me put that another way.

A graph is a group of nodes connected by relationships.

All of a sudden this starts to sound a bit like a family tree where the nodes represent people and the relationships represent, er, the relationships between them.

Back to the story. Getting started.

If you want to try this as well, there are a few hoops to jump through first. If you just want to skip to more pretty pictures, go to the next section. :-)

Prerequisites

You’ll need Docker for this preferably running on Ubuntu (otherwise you’ll have to work out the pathnames stuff yourself)

You’ll also need a few other things installed:

  • Clone the github repo or download the zip
  • create a directory called tmp at the same level as Neo4J
  • Ruby: sudo apt-get install ruby
  • Ruby-dev: sudo apt-get install ruby-dev
  • Some gems: sudo gem install gedcom and sudo gem install neography
  • A gedcom file: Gedcom files are a structured text document format that can contain people and relationships from a family tree. From Findmypast you can export your family tree as a gedcom file by visiting the view all trees page and clicking the middle export tree icon next to your tree.

    If you don’t have a family tree, you can use mine or use the sample Harry Potter one in the directory (wizard!).

I think that is everything!

Setting up.

First, you need to convert your gedcom file to a format that can be imported into a graph database. Run this:

cd neo4jfmp/import/gedcom/

./gedcom2csv.rb <Name of you gedcom file without the .ged extension>

My ruby-fu is not great and there could be bugs here with your gedcom. Here be dragons - You have been warned.

cd ../../../

./startneo.sh /media/psf/AllFiles/Users/aclark/Projects/Docker/tmp/

That last command will need the path to your tmp folder, not mine…

And it might take a while to run first time.

You should now have an empty graph database waiting for you! Go to the Neo4J database page and you should see something like this:

OK, let’s import your tree…

In the input box at the top, paste the following (you’ll find this snippet in Neo4J/neo4jfmp/cypher/import_persons.cyp file along with all the other snippets we’ll run)

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS
FROM 'file:////gedcomImportPersons.csv' as record FIELDTERMINATOR '|'
WITH record
CREATE (person:PERSON {gedcomId: record.gedcomId})
WITH person, record
SET
      person.gender = record.gender,
      person.label = record.label,
      person.firstName = record.firstName,
      person.lastName = record.lastName,
      person.yearOfBirth = toInteger(trim(record.yearOfBirth)),
      person.yearOfDeath = toInteger(trim(record.yearOfDeath)),
      person.birthPlace = record.birthPlace,
      person.deathPlace = record.deathPlace
  WITH person, record, CASE WHEN record.gender = "M" THEN [1] ELSE [] END AS Males
  FOREACH (x IN Males | SET person:M)
  WITH person, record, CASE WHEN record.gender = "F" THEN [1] ELSE [] END AS Females
  FOREACH (x IN Females | SET person:F)
  WITH person, record, CASE WHEN record.gender = "U" THEN [1] ELSE [] END AS Unknowns
  FOREACH (x IN Unknowns | SET person:U);

and run it. Once that finishes, run this (import_families.cyp):

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS
FROM 'file:////gedcomImportFamilies.csv' as record FIELDTERMINATOR '|'
WITH record
MATCH (start_person {gedcomId: record.START_ID}), (end_person {gedcomId: record.END_ID})
WITH start_person, end_person, record
FOREACH(ignoreMe IN CASE WHEN record.TYPE = "CHILD" THEN [1] ELSE [] END |
	CREATE (start_person)-[:CHILD {empty_label: ''}]->(end_person)
)
FOREACH(ignoreMe IN CASE WHEN record.TYPE = "SPOUSE" THEN [1] ELSE [] END |
	CREATE (start_person)-[:SPOUSE {empty_label: ''}]->(end_person)
)
FOREACH(ignoreMe IN CASE WHEN NOT record.TYPE = "SPOUSE" AND NOT record.TYPE = "CHILD" THEN [1] ELSE [] END |
	CREATE (start_person)-[:RELATED_TO {empty_label: ''}]->(end_person)
);

Now this (create_root.cyp). Change the name to be the root person of the tree (probably you!):

MATCH (you {firstName:"Alexander", lastName:"Clark"})
WITH you
LIMIT 1
SET you:ROOT;

All done! Now you’re ready to do the fun part. But first! Let’s check everything has worked and you can see your tree.

MATCH (n:ROOT)
RETURN n;

This should pop up a little circle. Double click it and more little circles appear.

You can click on each node and see its properties at the bottom.

If you don’t like the colours, click on the star icon on the far left of the screen and drag Neo4J/import/gedcom/style.grass onto the dotted box.

The yellow relationships represent parent-child reclationships and the green ones represent spouse relationships.

OK. Let’s start with something basic. Like showing everyone in your tree!

Click the cog icon at the bottom left of the screen and change Initial node display to 1000

Now run this:

MATCH (n)
RETURN n;

Boing! up pops everybody. You can make the window fullscreen with the diagonal arrows in the top right hand corner of the result pane. There’s a zoom bottom right to let you see everyone at once:

OK, that’s silly and doesn’t tell us anything.

How about the path between the youngest person and the oldest person in the tree? (These are in queries.cyp)

MATCH (youngest)
WITH youngest
ORDER BY 0 - youngest.yearOfBirth LIMIT 1
MATCH p = shortestPath((oldest)-[:CHILD *..25]-(youngest))
WHERE oldest.gedcomId <> youngest.gedcomId
WITH nodes(p) as rels
ORDER BY oldest.yearOfBirth LIMIT 1
UNWIND rels AS rel
OPTIONAL MATCH r = (rel)-[:SPOUSE]-()
RETURN rel, r;

At the bottom of the image, my daughter Eloise was born in 2011.

At the top, her great-great-great-great-great-grandfather was James Thorpe born in 1778 :-)

There’s quite a lot going on in this query. I’ll break some of it down:

In cypher (the language for querying Neo4J) you get data out by using the MATCH...RETURN query. We start with

MATCH (a_variable_name_you_make_up:A_LABEL_FROM_THE_DB)

then

RETURN a_variable_name_you_make_up;

There are ORDER BY and WHERE clauses to re-order and filter our query. The label you specify can also filter the results. You can use F as a label to get all the females for example.

Cypher works like a pipeline of data. You start with a lot of records at the start and then filter or group it, passing it onto the next stage of the pipeline with the WITH clause, after which you can run more MATCHing and filtering / grouping.

Graphs work by pattern matching an input against one or more nodes and relationships. You can give it a pattern and ask it to find that pattern in the graph.

In the above example, I’ve used the shortestPath function to tell me the quickest way to get from one node to another in the graph.

Nodes and relationships between nodes are represented in ASCII:

(var_node)-[:RELATIONSHIP_LABEL *..HopsNumber]->(another_var_node)

shortestPath() takes this relationship pattern, looks for all occurrences of it in the graph, and the returns the one which has the fewest hops form the start to the end.

How about your most distant relative?

MATCH (n {firstName:"Alexander", lastName:"Clark"})
WITH n
MATCH p = shortestPath((a)-[*1..1000]-(n))
WHERE a.gedcomId <> n.gedcomId
WITH p
WHERE length(p) = 8
RETURN p;

(Your mileage with this one may vary. You’ll need to change the name for your tree, and change the 8, increasing it until you don’t get a match. There’s probably a better way of finding the longestPath, but this seemed reasonably clear)

I’ve used a shorthand in the query of putting the properties I’m filtering by into the MATCH clause, rather than use a WHERE clause.

So Jane Thorpe at the top is my most distant relative that I know about at the moment.

Most popular first names in your tree?

MATCH (n)
RETURN n.firstName, count(n.firstName) as occurances
ORDER BY occurances DESC
LIMIT 10;

Here, data comes back as a table rather than nodes as I asked for a count aggregate and just the firstName property.

John is fairly predictably the most common, but there’s an Isabella in the list which is similar to my daughter’s name Isobel and my name is in the top ten.

Final challenge.

My daughter asked me how often her name appears in our tree. Simple, I thought:

MATCH (n)
WHERE n.firstName = "Isobel"
RETURN n;

That should give me all the bouncing balls with her name.

Oh, only one:

But I know there are others! Oh, they spell it differently (or it’s a variation)

MATCH (n)
WHERE n.firstName = "Isabella"
RETURN n;

But what if I hadn’t known that or there were misspellings in my tree? Because Neo4J is built on Lucene I can do a fuzzy search:

MATCH (n)
WHERE apoc.text.fuzzyMatch(n.firstName, "Isobel")
RETURN n;

Here, I’ve made use of a plugin of stored procedures called apoc, but dammit, still only one result!

What’s going on here? Oh. it seems that in recent releases of Lucene, to make fuzzy matching faster it’s now limited to two characters being different or missing between the two texts being compared.

So Isobel -> Isabella has 3 changes (the ‘o’ to an ‘a’, the extra ‘l’, the extra ‘a’)

OK, can’t use fuzzy matching out of the box. Google search… Ah! phoneticDelta! Let’s have a look through the list of built-in procs and see that:

CALL dbms.procedures()
YIELD name, signature, description as text
WHERE name = 'apoc.text.phoneticDelta'
RETURN *
apoc.text.phoneticDelta(text1, text2)
yield phonetic1, phonetic2, delta

Compute the US_ENGLISH soundex character
difference between two given strings

OK, looks promising. Soundex is a way of quantifying how similar 2 bits of text are. Try this:

MATCH (n)
CALL apoc.text.phoneticDelta(n.firstName, "Isobel")
YIELD delta
WITH n, delta
WHERE delta > 3
RETURN n;

(delta here means how similar they are with a 4 being the max.)

Yay! all the nodes! But how are they related to Isobel?

MATCH (n)
CALL apoc.text.phoneticDelta(n.firstName, "Isobel")
YIELD delta
WITH n, delta
WHERE delta > 3
MATCH p = shortestPath((a {firstName:"Isobel", lastName:"Clark"})-[*1..1000]-(n))
WHERE a.gedcomId <> n.gedcomId
RETURN p;

Nice.

Final thoughts

Family trees lend themselves very well to Graph Databases and Neo4J is a great tool.

There’s LOADS more to do on this. I could write a proper plugin for doing name matching better.

Or do something with the birthPlace location data to analyse how my family moved over time.

Or load in other trees and see if we have any overlap in our relations

Or work this up into tool for the site to display trees.

If you want to learn more about graphs and Neo4J there are lots of googleable resources. If you got stuck, drop me a note and I’ll see what’s wrong. Or send me your gedcom and we can go through it together.

Alex Clark
Search Developer, Findmypast
aclark@findmypast.com
www.findmypast.co.uk