Escape from the Dark

LITA Forum

November 9, 2018

David Naughton - UMN Libraries Web Development

naughton@umn.edu

z.umn.edu/lita2018

In the Dark: Excellent investigative journalism podcast by Madeleine Baran. Season One is about what went wrong in the Jacob Wetterling case.

Perhaps I could best describe my experience of doing mathematics in terms of entering a dark mansion. One goes into the first room, and it’s dark, completely dark. One stumbles around bumping into the furniture, and gradually, you learn where each piece of furniture is, and finally, after six months or so, you find the light switch. Andrew Wiles

Bounded Rationality

Extended quotation on the "Impossibility of Unfamiliar Optimization When Decision Time Is Scarce" follows.

Impossibility of Unfamiliar Optimization ...

In the case of an unfamiliar problem, the decision maker must devise a method for finding the alternative to be chosen before it can be applied. This leads to two levels of decision-making activities which both take time:

Level 1: Finding the alternative to be chosen.
Level 2: Finding a method for Level 1.

Impossibility of Unfamiliar Optimization ...

What is the optimal approach to the problem of Level 2? One can hardly imagine that this problem is familiar. Presumably a decision maker who does not immediately know what to do on Level 1 will also not be familiar with the task of Level 2. Therefore, some time must be spent to find an optimal method for solving the task of Level 2. Thus we arrive at Level 3.

Impossibility of Unfamiliar Optimization ...

It is clear that in this way we obtain an infinite sequence of levels k = 2, 3 , ... provided that finding an optimal method for Level k continues to be unfamiliar for every k.

Level k: Finding a method for Level k - 1.

Selten, Reinhard. "What Is Bounded Rationality?" Bounded Rationality: The Adaptive Toolbox. Ed. G. Gigerenzer and R. Selten. Cambridge, MA: MIT Press, 2001. 13-36.

How to escape such darkness?

mea culpa

  • How can we escape faster?
  • How can we better predict required time?
  • What mistakes did I make?
  • What was difficult for me?
  • What do you think?

Experts@Minnesota Data Flow

Mistake 1

Underestimating Unknown Unknowns

Solutions

Things You May Not Know That You May Not Know

  • Pure
  • Your Org's HR Data & Systems
  • Pre-existing integration systems, e.g., Master List?
  • Other systems to be integrated?
  • New, dependent systems to be created?

Things You May Not Know That You May Not Know

  • ETL
  • Software Development
  • Software Project Management

Things You May Not Know That You May Not Know

  • Your Users
  • Your Co-Workers
  • Your Organization

unfamiliarity -> uncertainty

The more items you are unfamiliar with, and the more you are unfamiliar with each one, the greater your uncertainty.

More detail and illustrative examples about some of these items later.

Mistake 2

Believing Agile Hype

Most startup projects are not integrations.

Promising a minimum viable product (MVP) in two weeks may work well for a greenfield project, but is probably unrealistic for most integration projects.

Solutions

Allow time for discovery and understanding of systems to be integrated.

Short Feedback Cycles

Maybe the most valuable benefit of agile. Well explained in this talk, which also beautifully explains agile vs. waterfall, and how software engineering is different from other kinds of engineering: Real Software Engineering

If Pure allowed updating individual records via an API, rather than requiring uploading entire datasets in Sync, feedback cycles would be dramatically reduced!

Mistake 3

Tightly Coupled Designs

Seductive idea: building a single system to do many things will save work, "kill two birds with one stone". In reality, such a system tends to have tight dependencies among its parts, which requires that multiple problems be solved at once. Long feedback cycles!

Solution

Break systems down into smaller systems, with as few dependencies as possible, even if it seems like more work.

Mistake 4

Extreme Rationalism

Rationalism vs. Empiricism

Rationalists claim that there are significant ways in which our concepts and knowledge are gained independently of sense experience. Empiricists claim that sense experience is the ultimate source of all our concepts and knowledge. Stanford Encyclopedia of Philosophy

Solutions

Less Leibniz...

...more Hume...

...and Leonardo!

Leonardo: The Man Who Saved Science

No plan survives contact with the enemy. Helmuth von Moltke the Elder

No design survives contact with reality.

Pure API Interactive Documentation

{your.pure.domain}/ws/api/{pure-version}/api-docs/index.html#/

Example: experts.umn.edu/ws/api/512/api-docs/index.html#/

Some higher-level overview documentation would make this even better. I also could find no schema for the changes endpoint.

UMN Pure API Client

github.com/UMNLibraries/pureapi

Better documentation and an open source license coming soon.

Surprises in UMN HR Data

OIT Data Warehouse containing data from PeopleSoft. Completely unfamiliar.

Surprises in UMN HR Data

  • No unique identifiers for people's jobs! (staff organisation relations)
  • No reliable way to distinguish a person's multiple jobs from each other.
    • There is a position_nbr in the main "jobs" table, but no one uses it consistently.
    • Some people define a job as a position in a department, which could include a series of different roles or duties (jobcodes).
    • Others define roles with different jobcodes as different jobs.

Surprises in UMN HR Data

  • No reliable way to determine when jobs start and end.
    • No common definition of "job" (see above).
    • Several columns that may help, if not null:
      • job_entry_dt
      • position_entry_dt
      • last_date_worked
      • effdt combined with effseq, empl_status, status_flg, and job_terminated

Surprises in UMN HR Data

Affiliate jobs (e.g., adjunct faculty) in a "person of interest" table, with very different columns than the main "jobs" table, and with even worse data entry.

Mistake 5

Failure to consider systems to be replaced as systems to be integrated.

Example: Moving from Master List Spreadsheets to XML Sync

In cases like ours, where we have no unique ID's for people's jobs, Pure automatically assigns its own unique ID's to jobs:

autoid:{organisation-id}{job-title}{employment-type}{start-date}

Afraid I had deleted existing data!

Did not carefully enough consider effects of a new upload on existing data. One reason I didn't know about these Pure-generated job ID's. (I also don't think these Pure ID schemes are documented.) Even worse, I mistakenly synced to production!

Mistake 6

Poor Communication

Technology is easy. People are difficult.

May seem easy, especially for the outgoing. But what's hard is making ourselves vulnerable, handling conflict, admitting failure or lack of knowledge.

The more complex and unfamiliar the project, the more likely the need for difficult communication.

Solutions

  • courage
  • empathy
  • humility

Community Support

Challenge: TMI

Too Much Information

In all the voluminous Pure documentation, how do we find what we need?

Solutions

  • Subscribe to updates to the Confluence Pure Client Space.
  • XSD's for organisations, persons, research outputs, etc. are in the Pure Portal under "Bulk Import". The data returned by the Pure API uses almost identical schemas.
  • PDF document on populating data in Pure.

Questions?

Thank you!