We are helping Deb Zajchowski at The Clearity Foundation on their patient database. The Clearity Foundation is a non-profit organization dedicated to "improving treatment options for ovarian cancer patients." To improve treatment they take a personalized medicine approach. They have a privacy-ensured database in which they collect information on the patient’s clinical history, including physician’s diagnosis, diagnostic procedures, treatments and the results from tumor molecular profiling analyses. The database also records the drugs that are likely to have clinical benefit based on the profile of the patient’s cancer.
Our goal is to enhance the Clearity database ability to track the patient outcomes in a retrospective analysis. The current database was developed by Michael L. Petka. It works well for the profiling and reporting needs of the organization. However, Mike is currently occupied with other aspects of database enhancement.
In order to ensure the privacy of the patient, we needed a test database that did not contain any actual patient information. In this blog I will describe how we created a test Patients table in Excel.
The Patients table was created by Michael L. Petka, and is made up of the following columns:
Column Name | Description |
---|---|
PatientID | Patient Identification Key |
PatientLastName | Last Name |
PatientFirstName | First Name |
PatientMiddleInitial | Middle Initial |
PatAddress | Street Address |
PatCity | City |
PatState | State |
PatZip | 5-digit ZIP Code |
SSN | Social Security Number |
DOB | Date of Birth |
Sex | Sex |
PatientTelephoneNumber | Telephone Number |
PatientEmail | Patient Email |
Patient Identification Key
The patient identification key (PatientID) is a sequential numbering of the patient rows.
Patient Name
We were fortunate to find a table of 10,000 random names at The ColdFusion Open Source Software Blog. In addition to the columns we needed, this database has complete name strings, both first name first, and last name first. The first name, last name and middle initial columns were cut and pasted into our test Patients table in Excel.
Patient Sex
The sex of the random names was determined in Excel by comparing the first name of a random name with a list of girls’ names from RandomNames.com. If the name matched, F (female) was assigned, otherwise M (male) was assigned, using the function below.
Sex Determination Based on Name
=IF(EXACT(A5,LOOKUP(A5,I$2:I$895)),"F","M")
where column A contains the random first name (above element A5 is being examined), and array I2 to I895 contains the girls’ first names.
Street Address
We found 261 of the most popular neighborhood street names at Living Places. We generated random house numbers and random street names, then concatenated them together in Excel.
Random House Number
=INT(11000*RAND())
Random Street Name
=INDIRECT("StreetNames!B"&RANDBETWEEN(2,262))
where array B2 to B262 in the StreetNames worksheet contains the most-popular neighborhood street names.
Street Address Concatenation
=CONCATENATE(A5," ", B5)
In the example above we are working on row 5, where column A contains the random house number and column B contains the random street name.
City, State and ZIP code
A table containing city, state and ZIP codes of 80,810 places was downloaded from A Free Zip Code Database as the Excel file free-zipcode-database.xlsx. This file has more than just city, state and ZIP codes. It also includes type (STANDARD, PO BOX ONLY, MILITARY, etc.), county name, latitude, longitude, population, land area and water area. In Excel, the table was sorted on type, for we only wanted to use types STANDARD and PO BOX ONLY. After removing everything else, the list was reduced from 80,180 to 73,756 entries. We then generated 10,000 random ZIP keys to match up with our 10,000 random names in Excel using RANDBETWEEN(2,73757). All of this was stored in the City worksheet in Excel. In the Patients table (worksheet), the city, state and ZIP codes were added using arandom ZIP code key. The city name was capitalized using the PROPER function; see below.
City, State and ZIP Code
=PROPER(INDIRECT("City!D"&City!M5)) =INDIRECT("City!E"&City!M5)) =INDIRECT("City!A"&City!M5))
Where “City!D”, “City!E” and “City!A” are the city, state and ZIP code columns, respectively, in the City worksheet.
Social Security Number
A random social security number (SSN) was generated for each fictitious patient by concatenating three RANDBETWEEN operations together; see below.
Social Security Number
=CONCATENATE(RANDBETWEEN(100,999),"-",RANDBETWEEN(10,99),"-",MID((RANDBETWEEN(10000,19999)),2,4))
Date of Birth
We wanted a normal distribution of birthdays, so we randomly seeded the NORMINV function with random value between 0 and 1. We used a mean of 21,000 (June 29, 1957) and a standard deviation of 5,000 days (13.7 years) in the NORMINV function to get an acceptable distribution; see below.
Normally Distributed Date of Birth
=NORMINV(RAND(),21000,5000)
Telephone Number
Random telephone numbers were generated in the same manner as the social security numbers, by concatenation of three RANDBETWEEN operations; see below.
Telephone Number
=CONCATENATE(INT(RANDBETWEEN(100,999)),"-",INT(RANDBETWEEN(100,999)),"-",MID((RANDBETWEEN(10000,19999)),2,4))
Email Address
Fictitious patient email addresses were constructed from each fictitious patient’s first name, last name and the domain name “@example.com”; see below.
Email Address
=CONCATENATE(C5,".",B5,"@example.com")
where columns B and C contain the last and first names, respectively. In the example above we are working on row 5.
Stability of the Excel Random Cells
We found that the randomly created cells in Excel were not stable. If such a cell were edited, previous random values would be regenerated, effecting all the cell that depended on that random number. Therefore, the final step was to write out the Patients worksheet as a comma-separated values (CSV) file, and then to read the CSV back into Excel. In this way we were able to create a stable test Patients table in Excel.
Pam Todd Thanks for your iingshts. Technology has brought many important innovations to the healthcare landscape, but it hasn?t changed the importance of the doctor/patient relationship.