Tom MacWright

tom@macwright.com

GEDCOM

Update: Additional notes from newer GEDCOM standards

A half-decade ago, I wrote a parser for genealogical data called parse-gedcom. I worked on it a little bit when I was making a handcrafted family tree as a Christmas gift, but besides that, haven’t thought about it that much.

But recently, I’ve been working on the project more and immersing myself in the GEDCOM specification. There’ll be a new release in a little while, but the format deserves its own piece.

GEDCOM is magical in unusual aspects, and troubling in others. Most of its magic has to do with ambiguity.

The difference between reality and a computer’s encoding of reality is the source of a lot of anger, disagreement, and pain in the world. People have an ambiguous, nuanced, rich understanding of the world. They don’t have strict schemas for their thoughts.

Computers are strict. Some HR systems can’t cope when an employee has one legal name and another they use every day. Social networks let you mark someone as a colleague or a friend, but not just a little of both. Calculators don’t let you give vague estimated numbers and return a range of results.

This gap is where people talk about what data is. A lot of people will say that a spreadsheet is data, but an essay isn’t. A video interview isn’t data, but the systematized observations of that interview are data. Structured things are data, unstructured things aren’t. Data is specific, systematized, machine-readable.

There’s a lot of fogginess here. Run an essay through some ML-based language processing and it starts to look more like “data”. Give your data enough structure and it starts looking more like English and less like a spreadsheet.

Anyway, it’s not that computers can’t represent ambiguity, but that most software chooses not to. Ambiguity is complicated to represent and it seeps into every part of your application. If you allow for fuzzy numbers in one field, all of the math that touches those numbers now needs to be represented as a range of outcomes rather than just one. If you allow freeform input in a survey instead of multiple-choice, then it becomes a lot harder to visualize.

Cue a lot of conflict and anger about technology. Gender inputs that give you two options. Maps that show one country’s conception of the world’s borders. Forms that makes assumptions based on cultural biases. Systems that don’t capture the details or reality or the ambiguity of knowledge.

Ambiguity

But GEDCOM supports ambiguity. It supports enough ambiguity to make a sociologist shed tears of joy and a programmer shed the normal kind.

Starting with the most interesting bit: date ambiguity. A date in GEDCOM can look like “2021”, meaning any time in 2021. Or it could be “JAN 2021” for any time that January. It could be “BET 2020 AND 2021” for any time between those two dates. The format supports these things, so the software usually does too. The genealogy software I use, MacFamilyTree (no relation), will show approximate dates in timelines and other visualizations.

Zooming out, GEDCOM supports ambiguity of everything. If someone has multiple names they use, you can add multiple names, in order of which ones they prefer. Not sure whether someone was born in Morristown NJ or Morrestown NJ or Morris Township NJ? You can add multiple values of anything, in descending order of reliability. And you can cite those values with sources and notes to help future records-keepers keep track.

There are systems that handle ambiguity in programming, but they’re pretty few and far between. The EDTF standard, maintained by the Library of Congress, can handle date uncertainty. There was an abortive effort to create a W3 standard for uncertainty, but it didn’t go anywhere. If you want to represent a date that’s just “2021” in any programming language - not January 1, 2021, you’ll be hard-pressed to find a good representation.

GEDCOM isn’t just a file format that handles factual and chronological uncertainty, it’s a massively-successful, widely-implemented, de-facto standard with those attributes. In one sense, it faces the complexity of the real world, of applied epistemology, and embraces it. In another, it does the opposite.

GEDCOM is a project of the Church of the Latter-Day Saints - informally, the Mormon Church. In particular, FamilySearch, one of the more popular genealogy websites, is a wing of LDS and a successor of the Genealogical Survey of Utah.

Religious enthusiasm for ancestry isn’t exclusively Mormon - anyone who has dug into their Irish or Scottish kin has probably tried consulting the very thorough Catholic, Methodist, and Presbyterian registers. However, the intent of this effort is a bit different. From a BYU Studies Journal:

Impelled by what they refer to as the Spirit of Elijah, Church members seek to identify their ancestors and then perform sacred ordinances in their behalf in temples.

Sacred ordinances here refers primarily to baptism for the dead. In the same BYU article, an 1885 article from a church-owned journal says:

The same motive does not prompt the members of the various genealogical societies of New England and other places as urges the Saints to make similar researches… And so the work of forming these societies and collecting a publishing genealogical data goes on in this and other countries; and thousands of men are laboring assiduously to prepare the way, though unconsciously, for the salvation of the dead.

So: the intent of the Church was to know the names of all the dead, to perform proxy baptisms, to offer the dead a chance to be saved in the Mormon faith.

I’m not going to go much further into this one. There’s discussion around baptizing Jewish people, and an effort by the Vatican to stop sharing data of Catholics. It’s really something. But the takeaway as far as it affects GEDCOM is the intent and the control: this is a format essentially written by the church, for the purpose of posthumous baptism.

Does that affect the design choices in the standard? Yes, it does.

Same-sex relationships

From the Handbook, section 38.6.16:

Any other sexual relations, including those between persons of the same sex, are sinful and undermine the divinely created institution of the family.

Accordingly, the GEDCOM specification repeatedly defines marriage as between a man and a woman. It has a MARR object for a marriage, which can have only one HUSB and one WIFE. Examples of “supporting same-sex couples” in the format typically mark one man as a wife or woman as a husband. Some applications don’t respect that part of the specification (my parser is agnostic about this and all other structure), and Tamura Jones has written a very detailed and precise examination of how they can do that.

FamilySearch started supporting same-sex relationships in late 2019, while reiterating that the Church still considers those relationships sinful.

Sex

Speaking of mismatch with reality, the values for the SEX field in GEDCOM are:

  • M = Male
  • F = Female
  • U = Undetermined from available records and quite sure that it can’t be

Even taking this field name literally and only considering biological sex, the omission of intersex is alarming. And here, again, you could just put a different value, but you’re dramatically limiting compatibility.

ANSEL

GEDCOM made the mistake of hitching its wagon to ANSEL, a rare and limited character encoding. Thankfully, most programs don’t default to ANSEL, but those that do will struggle to handle any non-western scripts - which will botch a wide variety of names that people actually have.


Besides its cultural failings, GEDCOM is also an abandoned specification: the main current version is 5.5.1. A 6.x version was abandoned and a GEDCOM-X variant was abandoned too. The 5.5.1 specification is full of head-scratching mistakes and ambiguities, like fields that seem to be too long for the data that they contain.

Well-meaning people have spent ages trying to create a better standard. But there are a lots of crosswinds: the modern age of websites & platforms over desktop technology has made format standardization seem outmoded. The inertia of a massive, decently-functioning industry standard is hard to upset. This isn’t just that annoying xkcd cartoon - the process of crafting, popularizing, and maintaining standards is tough and involves a lot of luck.

I don’t know if there’s a future for genealogy that involves files and desktop applications like MacFamilyTree or Gramps. Maybe everyone immediately goes to Geni, FamilySearch, or a DNA company like 23andme.

For now, building a library that works with GEDCOM data opens up a lot of possibilities for people like me, who have been hand-crafting their family trees. If there’s a future for this way - a standards-oriented decentralized system or a traditional desktop system - I hope that it is extricated from the legacy of this format. There’s a lot of discussion still about whether technology can have moral valence or bias. GEDCOM answers that question.