Thursday, December 7, 2017

The Elusive DNA Match

The Ancestry DNA Circles Whitepaper is a great resource discussing how Ancestry goes about creating the DNA Circles.

One chart in particular grabbed my attention.

What I find interesting in the above chart is the statistics on tree completeness based on how far back in generations one goes. Notice that only 50% of trees are complete past 5 generations, that's 2nd great grandparents.

Also, I don't know how many of your AncestryDNA matches have trees, but mine sit at around 40%. Of those, maybe 20-30% are useful (more than 100 ancestors in their tree).

So if 30% of my matches have useful trees, and only 50% of those go back 5 generations or more, that means only about 435 (out of 2900) matches are going to the most helpful. (2900 * .3 = 870. 870/2 = 435).

Instead of looking at a DNA match and asking "How are we related?" If you have an established tree, you should be asking, "How does this DNA match corroborate the evidence I have in my tree?".

Sunday, November 12, 2017

Visualizing Ancestry Relationships using Google Fusion Tables (Lessons Learned - Part 1 of ?)

One of the really cool things about Fusion tables is that the output is not limited to charts. You can also create maps! And once a map is made, you can output what you've created into Google Earth as a KML file.  You might also be able to use the information in your family tree software if it supports maps or has a mapping feature.

You can also create a map!

Ah, but there is a hidden lurking monster waiting. You know it is there. You've recognized it long ago with the intention of dealing with it "some day". Well, at least for me anyway. What monster might that be? That monster is called "consistency" at its most basic level. Another way of framing it is "Standardization". Specifically, the area of consistency and standardization I am speaking of is location data.

For years, I had used Ancestry's Family Tree Maker software, upgrading to a new version every 5 years or so. Eventually, I didn't see the need for a desktop solution and have been using the interface online. Just recently, I wanted to try out Legacy Family Tree (Standard - free version), and installed and started playing with it. 

I knew this before, but one difference that caught my eye right off the bat that I was reminded of quickly is, doesn't have the Wildwest free-for-all that Ancestry does. There are pros and cons to each model, but if customers are paying (Ancestry) let them be free and wild. However, if the service is free, then some modicum of control is necessary. 

So FamilySearch uses a similar model as WikiTree in that, before you add someone, you have to ensure you aren't creating a duplicate person. If they find a similar person in the existing database they make you compare on the spot and either confirm they are the same, or if there is not enough evidence to make that call, you can create a new person (they at least try to limit the wildwest duplication problem).

Both Ancestry and FamilySearch enforce some standardization with date formats and locations. It is only with playing with Fusion tables that the monster reared its ugly head: "Lo mighty genealogisssst," it hissed with a forked, serpent-like tongue. (my imaginary monster doesn't sound like John Goodman). "You hath reaped that which you hath sewn."

I looked at the monster puzzled. He hissed again, "Garbage in-garbage out, Dumbass."

And the light appeared over my head as the proverbial lightbulb went on. It is not enough to just use standard locations (as provided by Ancestry or FamilySearch), one must also be consistent. 

People can look at two place names and make the connection that they are the same. Machines however, are extremely literal, doing exactly what you tell them to. So while you and I may see these as the same place:

Asheville, NC USA or 
Asheville, North Carolina, USA, or 
Asheville, NC, United States, or 
Asheville, etc etc etc, 

a machine will read these all as different locations.

Please note that all of these are likely considered "Standard" but if you are not consistent in your usage, you will have a headache later (as I do). 

This all came about when I uploaded my pedigree file (with all BMD info) to a fusion table in order to create a map. Here is an example of my ancestors Birth locations:

Map from fusion table showing Birth Locations.
I was curious what a Network Graph comparing Birth Locations to Death Locations would look like (is there any convergence or divergence?), and that is when I noticed how seemingly small differences to us humans, make huge differences to computers interpreting our input quite literally:
Eastham MA is counted twice because it is different.

So why is this such a big deal? Hopefully you can see the ramifications:
1) Your own data will give you poor quality results if they are non-standard
2) Even if standard, if they are inconsistently used, you will have a more difficult time tying people together
3) People who have the same ancestor in their tree will not necessarily be connected to your ancestor
4) On AncestryDNA, your shared ancestor hints won't be as high as it could be

At some point in the future, I am hoping sites will adopt an "authoritative" or "most accurate" model and start locking down people who have been confirmed. That way, the wildwest is slowly tamed with civility. I believe WikiTree has started locking some ancestors for public edit who are more known and established. If the goal is to establish a unified single tree (like WikiTree or FamilySearch), then limiting what can be changed (at some point) makes sense. If the data is always changing, say by some novice who copied someone else's tree without doing the work themselves, then it won't ever fully develop. But I digress, this side topic could be its own post.

First, use standardized date and location information. 
Second, be consistent in how you apply the standard ("USA" or "United States"? Pick one and change everything in your tree to the same format, yeah, not fun).
Third, make sure every new person added to your tree is formatted to your consistent standard.

Tuesday, November 7, 2017

Visualizing Ancestry Relationships using Google Fusion Tables (Part 2 - Import the sheet to Google Fusion Tables)

Prepare the spreadsheet (excel or google) following previous instructions in Part 1.

Now that your sheet is created, head on over to the Fusion Table site and upload your sheet.

Find your sheet and import it to fusion tables, and press “Next”

Ensure data is what you want to import, and press next:

Edit the table information how you want with a unique name, then press “finish”:

Once the table is loaded, click the red “+” and select “Add Chart”:

Explore the results! Press “Done” when finished configuring the Network plot chart:

Explore and build upon this idea. It was the most basic level to create a network diagram. Granted, creating the data was a pain!  Play around and experiment. Add a third column with distinguishing information like say, your shared CM with your match. Or maybe which chromosome that match shares with you. Who knows! Leave a comment on anything cool you discover or ideas for improving this.

Visualizing Ancestry Relationships using Google Fusion Tables (Part 1 - Creating the sheet)

First, I want to thank Randy W Whited for planting the seed to use Fusion Tables in the first place. Thank you!

This blog post describes how to create a simple but insightful network plot like the following (some names are blurred to protect the privacy of living people--no ancestors were harmed in the making of this chart):

The difficulty in creating this chart is not the chart itself, it is arriving at the required data to build it.

The hard part:

Some prerequisites:
1-you have an existing tree and can get a simple listing of your direct ancestors (your pedigree).
2-you have installed, and already run the DNAGedcom Client and have an "A_File" that is a listing of all your DNA matches' ancestors.
3-the format of your pedigree (Surname, Given name etc) match the format in the A_File.
4-using Excel (or Access) you can compare 1 above with 2 and come up with a listing of all the ancestors of all your matches who match your direct ancestors

If you have done the prerequisite work above, you probably already know where I am going with this.

Before diving in and setting up the required table, it might be good to take a moment to understand what a network plot is trying to do. In this case, I am only trying to do the most basic of things: compare items in column 1 to items in column 2. That is it. Very simple.

From that simple idea can come complex relationships. For this example, I want to visualize how I am related to those Ancestry DNA matches of mine who share common ancestors.

In a network diagram, the most basic explanation is that it plots a line between the value in column 1 with the value next to it in column 2. That is it. If I created a simple spreadsheet that looked like this:

The resulting network diagram would look like this:

So far, so good. Easy, right? So if I repeat the same in col 1 and something different in col 2, I get the following result:

So you can see, that with a simple idea can come some complex relationships:

That is the foundation for how Network diagrams work, whether in NodeXL, Gephi, or Google Fusion tables.

Setting up a spreadsheet with ancestor data is basically the same as right above. In the first column, add your ancestors list and in column 2, add those DNA matches who have them in their tree. Pay attention to the ancestor details because small difference will give different results (circled names in red, even though the same person in your mind, will be treated as two people):

Once you have that part done, you need to connect it to you. I found the easiest way was to simply copy the ancestors in column 1, and paste it under the last entry in column 2.

While they are still highlighted, go ahead and remove any duplicates, because this is where we connect your ancestors to you, and even if they occupy more than one Ahnentafel number, you only want them in this list one time while connecting them to you. (Under the Data Menu, you can find the Remove Duplicates Button. Don't expand the selection if asked):

The last thing to do now, is simply add your name into the blanks cells of column 1 next to your ancestors: 

That's it! You now have a file you can upload to Google's Fusion Tables and have some fun.

I will describe the process of uploading the file and creating the chart in Fusion Tables in the next post. (see part 2).

Monday, November 6, 2017

Deciding When to Chase down how you may be related to a DNA Match

I created the chart below as a reminder to myself the level of effort and/or difficulty in trying to track down a match based on your shared CM.

I typically don't waste my time on anything below 15CM, and try to stay around 40CM or higher. I use 40CM as a guide because I can easily remember 40cm is roughly the range of 4th cousin. So 20cm is around 5th cousin, 80cm is around 3C, etc etc.

Anyway, as CM halves each generation going back, the number of possible ancestors the DNA came from doubles. Using 30-year generations, the ability to distinguish specific Ancestral DNA (IBD) from the growing influence of pedigree collapse (IBS) becomes difficult after six generations, and near impossible after 8 generations.

So when someone asks why I don't use cm values below 15cm (or small segments below 7cm), I typically respond with, to what purpose? Me and someone may have a 10th GGF in common. We also may share a 4cm segment. The two facts can't ever be tied together, even if we find someone else who shares the same ancestor and same segment.

DNA should support your genealogy, not the other way around. DNA is simply another record; a "data point" like a birthday or a residence.

Obviously adoptees have a different story. In everything we do, we are going from the known to the unknown. For adoptees, the only thing known might be their DNA matches. But once they establish some genealogy, the DNA should become supportive, not definitive.

This chart is solely for my own use as a reminder of the futility of chasing small segments. Can you really discern which of your 2048 ancestors gave you that 3CM segment? Do you have your tree documented all the way out 11 generations? No really, on EVERY one of those 2048 branches? Only when all three people in a triangulated match have all 2048 branches well documented out on every branch can you even consider where a 3cm segment *might* have come from. And even then, are you even reasonably sure?

So STOP chasing small segments.
(click for larger version)

Sunday, August 20, 2017

Visualizing Your Family Tree Part 1

The below graphic is courtesy of B.F. Lyon Visualizations at

One neat option is displaying the flags of where each person was born to get an idea of your heritage.

Try it out for yourself. [click image for a clean version.]

One thing you may notice are any lines that cross other lines. Yup! Pedigree collapse!

Wednesday, August 9, 2017

Playing with DNA Information Part4

Well, all the preliminary work and background stuff is out of the way.

***From here on out, I will assume all csv files and any other files you've created have been imported into their own tables in MS Access.

We can finally start playing around with the information. What to do?

How about we try to find out which of my Ancestry matches can be identified on Gedmatch? How would you go about finding out?

This is what I did:

In Access:
I first wanted to isolate the "A" kit numbers, so I ran a simple query on my Gedmatch Match list. All fields were added to the query, and I put  Like "A*"  in the Kit Number field, and  >=7 in the Shared cM field.  This produced a list of all my Ancestry matches at Gedmatch. You can save this query with a unique name and we'll use it in a minute as a source for our next query.

With the new query created above saved, let's see what we get. My Ancestry Match file has over 21K entries. My Gedmatch Match file has over 12K+ entries. The query above (>=7cM) reduced this to just over 1000 entries. Now we need enough identifying information to be able to conclude that the person in the Ancestry list is the same as the person in the Gedmatch list. I used the following fields, and linked the Full Name from the Gedmatch Query with the Admin name in the Ancestry table:

Which produced some interesting results! I was able to "verify" (I use the term loosely) the accounts for over 50 people. Here is an example of the results. Many were easy to determine since people tend to use the same username everywhere. And when it is not obvious, don't keep it. We don't need to be creating false positives!

Save the results in a new table in the database. I specifically kept the KitID from Gedmatch and the MatchID from Ancestry in the above query. This has effectively become a join table where I can now link Gedmatch Chromosome browser information to their Ancestry Tree (assuming there is one).

We'll find out in the next post!