Thursday, December 7, 2017

The Elusive DNA Match

The Ancestry DNA Circles Whitepaper is a great resource discussing how Ancestry goes about creating the DNA Circles.

One chart in particular grabbed my attention.


What I find interesting in the above chart is the statistics on tree completeness based on how far back in generations one goes. Notice that only 50% of trees are complete past 5 generations, that's 2nd great grandparents.

Also, I don't know how many of your AncestryDNA matches have trees, but mine sit at around 40%. Of those, maybe 20-30% are useful (more than 100 ancestors in their tree).

So if 30% of my matches have useful trees, and only 50% of those go back 5 generations or more, that means only about 435 (out of 2900) matches are going to the most helpful. (2900 * .3 = 870. 870/2 = 435).

Instead of looking at a DNA match and asking "How are we related?" If you have an established tree, you should be asking, "How does this DNA match corroborate the evidence I have in my tree?".

Sunday, November 12, 2017

Visualizing Ancestry Relationships using Google Fusion Tables (Lessons Learned - Part 1 of ?)

One of the really cool things about Fusion tables is that the output is not limited to charts. You can also create maps! And once a map is made, you can output what you've created into Google Earth as a KML file.  You might also be able to use the information in your family tree software if it supports maps or has a mapping feature.

You can also create a map!

Ah, but there is a hidden lurking monster waiting. You know it is there. You've recognized it long ago with the intention of dealing with it "some day". Well, at least for me anyway. What monster might that be? That monster is called "consistency" at its most basic level. Another way of framing it is "Standardization". Specifically, the area of consistency and standardization I am speaking of is location data.

For years, I had used Ancestry's Family Tree Maker software, upgrading to a new version every 5 years or so. Eventually, I didn't see the need for a desktop solution and have been using the Ancestry.com interface online. Just recently, I wanted to try out Legacy Family Tree (Standard - free version), and installed and started playing with it. 

I knew this before, but one difference that caught my eye right off the bat that I was reminded of quickly is, FamilySearch.org doesn't have the Wildwest free-for-all that Ancestry does. There are pros and cons to each model, but if customers are paying (Ancestry) let them be free and wild. However, if the service is free, then some modicum of control is necessary. 

So FamilySearch uses a similar model as WikiTree in that, before you add someone, you have to ensure you aren't creating a duplicate person. If they find a similar person in the existing database they make you compare on the spot and either confirm they are the same, or if there is not enough evidence to make that call, you can create a new person (they at least try to limit the wildwest duplication problem).

Both Ancestry and FamilySearch enforce some standardization with date formats and locations. It is only with playing with Fusion tables that the monster reared its ugly head: "Lo mighty genealogisssst," it hissed with a forked, serpent-like tongue. (my imaginary monster doesn't sound like John Goodman). "You hath reaped that which you hath sewn."

I looked at the monster puzzled. He hissed again, "Garbage in-garbage out, Dumbass."

And the light appeared over my head as the proverbial lightbulb went on. It is not enough to just use standard locations (as provided by Ancestry or FamilySearch), one must also be consistent. 

People can look at two place names and make the connection that they are the same. Machines however, are extremely literal, doing exactly what you tell them to. So while you and I may see these as the same place:

Asheville, NC USA or 
Asheville, North Carolina, USA, or 
Asheville, NC, United States, or 
Asheville, etc etc etc, 

a machine will read these all as different locations.

Please note that all of these are likely considered "Standard" but if you are not consistent in your usage, you will have a headache later (as I do). 

This all came about when I uploaded my pedigree file (with all BMD info) to a fusion table in order to create a map. Here is an example of my ancestors Birth locations:

Map from fusion table showing Birth Locations.
I was curious what a Network Graph comparing Birth Locations to Death Locations would look like (is there any convergence or divergence?), and that is when I noticed how seemingly small differences to us humans, make huge differences to computers interpreting our input quite literally:
Eastham MA is counted twice because it is different.

So why is this such a big deal? Hopefully you can see the ramifications:
1) Your own data will give you poor quality results if they are non-standard
2) Even if standard, if they are inconsistently used, you will have a more difficult time tying people together
3) People who have the same ancestor in their tree will not necessarily be connected to your ancestor
4) On AncestryDNA, your shared ancestor hints won't be as high as it could be

At some point in the future, I am hoping sites will adopt an "authoritative" or "most accurate" model and start locking down people who have been confirmed. That way, the wildwest is slowly tamed with civility. I believe WikiTree has started locking some ancestors for public edit who are more known and established. If the goal is to establish a unified single tree (like WikiTree or FamilySearch), then limiting what can be changed (at some point) makes sense. If the data is always changing, say by some novice who copied someone else's tree without doing the work themselves, then it won't ever fully develop. But I digress, this side topic could be its own post.

QC THY DATA! 
First, use standardized date and location information. 
Second, be consistent in how you apply the standard ("USA" or "United States"? Pick one and change everything in your tree to the same format, yeah, not fun).
Third, make sure every new person added to your tree is formatted to your consistent standard.








Tuesday, November 7, 2017

Visualizing Ancestry Relationships using Google Fusion Tables (Part 2 - Import the sheet to Google Fusion Tables)


Prepare the spreadsheet (excel or google) following previous instructions in Part 1.

Now that your sheet is created, head on over to the Fusion Table site and upload your sheet.


Find your sheet and import it to fusion tables, and press “Next”


Ensure data is what you want to import, and press next:




Edit the table information how you want with a unique name, then press “finish”:



Once the table is loaded, click the red “+” and select “Add Chart”:




Explore the results! Press “Done” when finished configuring the Network plot chart:




Explore and build upon this idea. It was the most basic level to create a network diagram. Granted, creating the data was a pain!  Play around and experiment. Add a third column with distinguishing information like say, your shared CM with your match. Or maybe which chromosome that match shares with you. Who knows! Leave a comment on anything cool you discover or ideas for improving this.






Visualizing Ancestry Relationships using Google Fusion Tables (Part 1 - Creating the sheet)

First, I want to thank Randy W Whited for planting the seed to use Fusion Tables in the first place. Thank you!

This blog post describes how to create a simple but insightful network plot like the following (some names are blurred to protect the privacy of living people--no ancestors were harmed in the making of this chart):


The difficulty in creating this chart is not the chart itself, it is arriving at the required data to build it.

The hard part:

Some prerequisites:
1-you have an existing tree and can get a simple listing of your direct ancestors (your pedigree).
2-you have installed, and already run the DNAGedcom Client and have an "A_File" that is a listing of all your DNA matches' ancestors.
3-the format of your pedigree (Surname, Given name etc) match the format in the A_File.
4-using Excel (or Access) you can compare 1 above with 2 and come up with a listing of all the ancestors of all your matches who match your direct ancestors

If you have done the prerequisite work above, you probably already know where I am going with this.

Before diving in and setting up the required table, it might be good to take a moment to understand what a network plot is trying to do. In this case, I am only trying to do the most basic of things: compare items in column 1 to items in column 2. That is it. Very simple.

From that simple idea can come complex relationships. For this example, I want to visualize how I am related to those Ancestry DNA matches of mine who share common ancestors.

In a network diagram, the most basic explanation is that it plots a line between the value in column 1 with the value next to it in column 2. That is it. If I created a simple spreadsheet that looked like this:


The resulting network diagram would look like this:

So far, so good. Easy, right? So if I repeat the same in col 1 and something different in col 2, I get the following result:


So you can see, that with a simple idea can come some complex relationships:

That is the foundation for how Network diagrams work, whether in NodeXL, Gephi, or Google Fusion tables.

Setting up a spreadsheet with ancestor data is basically the same as right above. In the first column, add your ancestors list and in column 2, add those DNA matches who have them in their tree. Pay attention to the ancestor details because small difference will give different results (circled names in red, even though the same person in your mind, will be treated as two people):


Once you have that part done, you need to connect it to you. I found the easiest way was to simply copy the ancestors in column 1, and paste it under the last entry in column 2.


While they are still highlighted, go ahead and remove any duplicates, because this is where we connect your ancestors to you, and even if they occupy more than one Ahnentafel number, you only want them in this list one time while connecting them to you. (Under the Data Menu, you can find the Remove Duplicates Button. Don't expand the selection if asked):


The last thing to do now, is simply add your name into the blanks cells of column 1 next to your ancestors: 

That's it! You now have a file you can upload to Google's Fusion Tables and have some fun.

I will describe the process of uploading the file and creating the chart in Fusion Tables in the next post. (see part 2).





















Monday, November 6, 2017

Deciding When to Chase down how you may be related to a DNA Match

I created the chart below as a reminder to myself the level of effort and/or difficulty in trying to track down a match based on your shared CM.

I typically don't waste my time on anything below 15CM, and try to stay around 40CM or higher. I use 40CM as a guide because I can easily remember 40cm is roughly the range of 4th cousin. So 20cm is around 5th cousin, 80cm is around 3C, etc etc.

Anyway, as CM halves each generation going back, the number of possible ancestors the DNA came from doubles. Using 30-year generations, the ability to distinguish specific Ancestral DNA (IBD) from the growing influence of pedigree collapse (IBS) becomes difficult after six generations, and near impossible after 8 generations.

So when someone asks why I don't use cm values below 15cm (or small segments below 7cm), I typically respond with, to what purpose? Me and someone may have a 10th GGF in common. We also may share a 4cm segment. The two facts can't ever be tied together, even if we find someone else who shares the same ancestor and same segment.

DNA should support your genealogy, not the other way around. DNA is simply another record; a "data point" like a birthday or a residence.

Obviously adoptees have a different story. In everything we do, we are going from the known to the unknown. For adoptees, the only thing known might be their DNA matches. But once they establish some genealogy, the DNA should become supportive, not definitive.

This chart is solely for my own use as a reminder of the futility of chasing small segments. Can you really discern which of your 2048 ancestors gave you that 3CM segment? Do you have your tree documented all the way out 11 generations? No really, on EVERY one of those 2048 branches? Only when all three people in a triangulated match have all 2048 branches well documented out on every branch can you even consider where a 3cm segment *might* have come from. And even then, are you even reasonably sure?

So STOP chasing small segments.
(click for larger version)

Sunday, August 20, 2017

Visualizing Your Family Tree Part 1

The below graphic is courtesy of B.F. Lyon Visualizations at https://learnforeverlearn.com/ancestors/.

One neat option is displaying the flags of where each person was born to get an idea of your heritage.

Try it out for yourself. [click image for a clean version.]


One thing you may notice are any lines that cross other lines. Yup! Pedigree collapse!

Wednesday, August 9, 2017

Playing with DNA Information Part4

Well, all the preliminary work and background stuff is out of the way.

***From here on out, I will assume all csv files and any other files you've created have been imported into their own tables in MS Access.


We can finally start playing around with the information. What to do?

How about we try to find out which of my Ancestry matches can be identified on Gedmatch? How would you go about finding out?

This is what I did:

In Access:
I first wanted to isolate the "A" kit numbers, so I ran a simple query on my Gedmatch Match list. All fields were added to the query, and I put  Like "A*"  in the Kit Number field, and  >=7 in the Shared cM field.  This produced a list of all my Ancestry matches at Gedmatch. You can save this query with a unique name and we'll use it in a minute as a source for our next query.

With the new query created above saved, let's see what we get. My Ancestry Match file has over 21K entries. My Gedmatch Match file has over 12K+ entries. The query above (>=7cM) reduced this to just over 1000 entries. Now we need enough identifying information to be able to conclude that the person in the Ancestry list is the same as the person in the Gedmatch list. I used the following fields, and linked the Full Name from the Gedmatch Query with the Admin name in the Ancestry table:
























Which produced some interesting results! I was able to "verify" (I use the term loosely) the accounts for over 50 people. Here is an example of the results. Many were easy to determine since people tend to use the same username everywhere. And when it is not obvious, don't keep it. We don't need to be creating false positives!












Save the results in a new table in the database. I specifically kept the KitID from Gedmatch and the MatchID from Ancestry in the above query. This has effectively become a join table where I can now link Gedmatch Chromosome browser information to their Ancestry Tree (assuming there is one).

We'll find out in the next post!







Sunday, August 6, 2017

Playing with DNA Information Part3

How many of your matches on Ancestry have trees? Is there a way to tell?  Of course! It is one of the files you will want to create for yourself anyway as a resource and reference.

Let's look at the "a" file created by the DNAGedcom client. In my case it is called "a_Clark_Lind.csv".

What is it telling us? Yes, it is a listing of the people in our matches' trees. But if you think about it, isn't it also a listing of matches who have trees? No tree, then they wouldn't be in this file!

So here is how you can create a list of just names.
-Create a new (blank) Excel file.
-Open your "a" file in Excel (a_Your_Name.csv) BE PATIENT, it can take a while to load..

Once open,
-click on columns C and D, highlighting them both.
-Right-click and select Copy (or ctrl+c)
-Go to the new Excel file and select cell A1. Right-click and select Paste (or ctrl+v).
-With both columns still highlighted, go to the Data Menu, and select Remove Duplicates.
-Save the file. Put it with the other files since you may as well import this into MS Access later anyway.

Now you know which of your matches on Ancestry has a tree. If you compare this file to your "m" file, you can also see who DOES NOT have a tree.

--------------------------

Another file you will want to have is a listing of your ancestors. Not your complete tree, just your direct ancestors. This will help out later when you start comparing matches. At this stage, we are not looking for matches by casting a wide net, that should already have happened. Now we are trying to see where those people match you in your tree.

There is no real easy way to create such a listing without using some genealogical software. One of the free programs I use for such things is Gramps (Gramps-project.org). If you download your Gedcom file from Ancestry (or have one already), you can open it in Gramps. Set yourself as the home person, then export to a new CSV file (not GEDCOM!!) using the option "Ancestors of Home Person" [you].

That will give you a csv file with just you and your direct relatives. Place it in the same folder as the other files.

These are just "utility" files that will come in handy later once you start comparing data.

More in the next part!

Playing with DNA Information Part2

Now let's look at what is actually in the different files.

Before we import anything into MS Access, it is a good idea to know what each file contains. It will become important later when we start developing queries.

On the left are the files created by the DNAGedcom client for uploading into the DNAGedcom site. Once they've been uploaded and processed, you will have the files on the right available to you (look in the Members/Files area on the DNAGedcom site).


If we look at what each file contains, we can start to get an idea of how we might combine otherwise uncombined information.

In the next part, I'll discuss a file or two that you will want to create that will come in handy.


Playing with DNA Information Part 1

This diagram is a quick layout of how the data moves around from different sites, and what that data looks like. This is not intended to be an exhaustive chart to end all charts. (click to enlarge)


Data created by the different sites and tools can also be utilized locally for our own purposes. "Why?" you may ask. Because even after using many of the tools, many questions still go unanswered. My ultimate goal will be revealed soon. :)


Wednesday, January 4, 2017

Cherokee Quest Part 1 - Clue 1: Mary "Minnie" Howe

Background:
It never really occurred to me that I might have Native American (NA) DNA. It was in looking at my DNA admixture that the possibility first entered the equation. Under the category AmerIndian, on Chromosome 9 there is definitely something (4.9%) worth investigating.

In speaking with relatives, I have found out that indeed, there are some who were supposedly Cherokee or other NA. Everything appears to center on my 2nd Great Grandfather, James Akes.

The first person of interest is his mother, my 3rd GGM Mary "Minnie" Howe. I can find no information so far other than the research by Judy Langen which has conflicting information. (Her research can be found here: http://www.geocities.ws/judylangen/akes/desc0001.htm#id14556).

Generation Four

40. John Jeff4 AKES (William P.3, Peter2AKERS (AKES), Mr1AKERS) was born on 20 Feb 1814 Indiana.7 He married Mary (Minnie) HOWE (an indian).7 He married Martha (--?--).7 He died on 26 Jul 1864 in Fort Smith, Scott County, Arkansas, at age 50.12
Mary (Minnie) HOWE (an indian) was born Missouri.7 She died. [***This is not very helpful.]
The nine known children of John Jeff4 AKES and Mary (Minnie) HOWE (an indian) were as follows:
  • 133. i. James H.5 AKES was born in Feb 1839 (prob) Daviess County, Kentucky.7 He married Melissa A. (--?--) [Farr] in 1869.7 He died on 4 Jan 1902 in Mill Creek, Pope County, Arkansas, at age 62.12On 8 Jun 1833, James AKES, of Perry County, Missouri, purchased land described as:
    the South East quarter of the South East quarter of Section thirty three, in Township thirty six North, of Range nine East, in the District of Lands subject to sale at Jackson, Missouri, containing forty acres.54  [***example of contradiction.. how can he purchase land 6 years before he is born?]
  • 134. ii. Martha AKES was born in 1842 (prob) Daviess County, Kentucky.7
  • 135. iii. John T. AKES was born in 1845 (prob) Daviess County, Kentucky.7 He married Semanthy Elizabeth H. HUSE.7
  • 136. iv. Smith AKES was born in 1848 (prob) Perry County, Missouri.7On 1 Jan 1859, Smith AKES, of Perry County, Missouri, purchased land described as:
    the South half of the South East quarter and the South half of the South West quarter of Section four, in Township thirty five North, of Range nine East, in the District of Lands subject to sale at Jackson, Missouri, containing one hundred and sixty acres.55
  • 137. v. Smith A. AKES was born in 1852 (prob) Perry County, Missouri.7
  • 138. vi. Mary Tillety "Telitha" AKES was born in 1853 (prob) Perry County, Missouri.7 She married James Anthony CROCKETT.7
  • 139. vii. Elitha AKES was born in 1855 (prob) Perry County, Missouri.7
  • 140. viii. Napoleon B. AKES was born in 1856 (prob) Perry County, Missouri.7
  • 141. ix. Daniel L. AKES was born in 1857 (prob) Perry County, Missouri.7
Martha (--?--) was born in 1825 Missouri.7
There were no known children of John Jeff4 AKES and Martha (--?--).

Mary "Minnie" Howe can be found in a couple of the Censuses, but more research is needed to find out who she was and where she came from.