In the previous post, I talked about transcribing documents from ADM 8. This post- about the process of uploading the documents to the database, was separated due to length. Fair warning- much of what I talk about in this post will be fairly techy and about coding, rather than about history. However, I think the discussion of how the upload scripts work, and how I’ve dramatically changed the upload scripts- twice- over the past several months will be useful b/c it shows that a) with web dev and digital humanities, it’s slightly different than with normal academic writing. With normal academic writing, when you’re doing with a PhD/article/etc you publish it, and it’s done. It’s etched in stone, typos and all (except online, where things can be corrected with a few keystrokes). With this project, and with academic webdev in general- there’s almost possibilities for continually upgrading research tools and projects.
To Start
First, necessary to talk about my programming environment and the set up. The vast majority of the coding (and transcription)is done on my laptop, then I copy and paste the text of those files into an SSH session, so that they are transferred to my development server (which is my old laptop, upstairs). All of this is a linux environment, and much of the work (including fixing errors in the code and transcription errors) is done via command-line-interface. I didn’t install a GUI onto my server, and well I’m paying for that now (actually it’s fine, except copying code from a CLI doesn’t work very well). My server is running apache and PHPmyAdmin, so I use the browser for working with the database itself. I use PHP for the scripts, and MYSQLI for the interactions with the MYSQL database (which is what WordPress uses).
A Basic Upload Script
The basic formula for the upload script is as follows. This is by no means an example of what others should do, it’s just the formula that I follow. Please refer back to the previous post for the examples of the .csv files.
1. Open a connection to the transcription file
2. Open a connection to the database (this could be later in the file, but I don’t think it hurts efficiency at all)
3. Read in the “Header Lines” – Report Title, Report Date, Column Labels, and Report Notes
4. Check to see whether the Report exists in the DB – if yes, exit, if no, then continue
5. Iterate through every line in the document, parsing them, writing to the appropriate table in the DB.
6. Write the ‘Report Data’ to the DB
7. Exit.
Each report type requires its own upload script. So far, I have five functional upload scripts- for Ship Deployment Lists, Officer Lists, Service Time/ Wage Reports, Status Reports and Fleet Lists. The next one to be written will be for the Additions/Removal lists, which cover all the major report types- After that it will be individual scripts for unique reports that don’t fit in these categories (there is less than a handful). There have been three distinct phases of the upload scripts, and the changes reflect both my process of relearning how to code, how to organize code effectively, lessons I have learned about the ADM 8 documents, and lessons I’ve learned about uploading data and data errors.
Phase the First
At first, I did something silly. I started by looking at the first couple of Ship Deployment Reports- and based my database design, and the upload scripts on that. (This coincides with the end of the previous series of blogs on this project, and the beginning of the second series, of which this is part).
In this first phase, each script was distinct- and so much of the code was repeated in each file. There was some neat bits in here- for example, from the beginning I included code so that as each line was parsed, it would check the database to see whether ships/officers/locations mentioned already existed. An early modification was to account for more names repeated more than once- for ships and officers, it would pull all the appropriate records from those tables, and then iterate through them one at a time, asking if that was the correct one. To make this work better, I included extra things in the Ships and Officers tables- like launch name, launch date, or a note. For example, this allowed me to differentiate between John Kempthorne (son) and John Kempthorne (father). There were some real problems with this phase of the scripts. For one thing, it involved a lot of reading and writing to the database- as each was row was completely read, parsed, and then all the information distributed to the correct tables in the database before moving on to the next row.
Phase the Second
When I finally read through all the documents in ADM 8 volumes 1 and 2, I realized that the database I had created, and the scripts I was using were not sufficient. I didn’t start from scratch, but I substantially rejigged things. Previously, I had created a shared file which allowed me to more easily switch between settings for connecting to the database without changing every single file. I now expanded upon that by moving all the shared code to the file. For example, since every upload file will always check whether the ship in the record exists in the DB, the function that does so was in the shared file. Similarly, for officers and locations. This was more efficient and allowed me to shared improvements across all the upload scripts by only making the adjustments in a single file. There were still some problems, however. For example, a single function both checked whether a ship existed in the database, and if it didn’t wrote to the database both a new entry for the Ships Table, and the similar one for officers also did more than one task. So, I still had some duplication of code (because more than one function could write the ship/officer/location record to the database). Further, although I had added code which specifically mentioned how many ships or officers of the same name existed in the database, it still iterated through them one at a time- which led to a significant number of data errors (for example, between two sloops named Chatham and both in service at the same time).
Phase the Third
In the most recent phase, I’ve made a couple of big useability/programming improvements. First, I redistributed the tasks in the shared file so that each function only does one task- eg check whether a ship is in the DB, or write an ‘Officer Service Record’ to the database. this effectively eliminated the duplicated code. Second, I rewrote the functions that check ships and officers against the database. The new function prints the entire list of existing officers/ships with identifying detail such as rate, dates and notes- and then I choose from the list. I also used a Do…While loop so that if I put in an incorrect answer (eg outside the correct range of numbers) then it loops and I can try again. It’s still possible to get errors as I can choose the wrong number from the list, but it prints out the selected option and record key so I can later go back into the DB through the browser and correct errors directly. The other major change was that in how the records were written to the database. Where previously each line had been read parsed, and the relevant records inserted into the DB before continuing with the next record, the new scripts create arrays of records. Well, mostly- new ships, officers and locations are immediately written to the database (because I need the keys for those records for other things, and I would rather do that, then check both the DB an an array of “new ships/new officers/new locations” when testing each row.) When the script is done parsing a file, it then echoes the contents of the arrays to the screen, so that I can check everything before consenting to them being written to the DB. I’ve found this to be a significant improvement, because it allows me to make sure that the parsing is working correctly, and to catch any errors that may be in the CSV file that I did not catch before. This is particularly useful when developing new scripts, and making sure that the fields and data are represented correctly. Unfortunately, I have not created a function that allows me to pause, and go into the arrays and correct issues, and if I spot something wrong I have to abort, fix the issue and run the script again. However, this is much better than the previous routine of doing the uploads, and not catching errors until I browsed the records in the database- at which point I had to purge them, and run the script again. Luckily, there are very few overly long reports so re-running the upload scripts is rarely onerous. Once the upload is complete, I export the database using PHPmyAdmin, and then use the same software to upload it to the live server.
You may have noticed I haven’t posted any pictures/examples like I did in previous posts- if you’re interested in the code, please email me and I can send over the files- though they’d probably not be much use.
In the next post in this series, I will be talking about the development of the user interface, and the difference between browsing and querying the database.
Pingback: Global Maritime History ADM 8 Project: New Transcription Interface - Global Maritime History