In the previous entry in this series, I looked at the two series (ADM 8 and ADM 107) that I am using for this project. In this entry, I’m going to talk specifically about database design. I’ve had a number of different thoughts and approaches, and there are a few problems. There are different ways that I could do it, depending on the priority. However, I think I have sorted out the database design, thanks to the intervention of a number of people.
So a few words are needed here about what I’m choosing. WordPress, MySQL, and PHP. There are lots of very legitimate criticisms coders and web devs could make about all of these choices, often about security. They are totally legitimate. I am using these tools because they are what I have to hand and frankly it would be beyond my means to select and provide the infrastructure to choose any other technology. And to be honest, this works for me because my previous experience was using PHP, so it’ll a good first step to get me back into coding, database work etc and make it easier for me to later transition into other scripting languages, database things etc.
To be quite frank, until very recently (A couple of days before this was posted)I had a quite poor sense of what the database would be, in my head. For one thing, I thought that ADM 8 would be entirely distinct from ADM 107. Second, I had in my head a database structure that rather directly mimic’d the structure of the two types of documents. Luckily, both of these are wrong.
I was lucky to be introduced to Charlotte Berry of the IHR, who said to me “You just have to get into the right frame of mind, a bit like looking at your sources as a robot”.
Let’s just let that sink in for a bit.
I mentioned above, that one of the biggest problems I had was that I was thinking about database design in a way that the tables etc would basically replicate the structure of the original documents. This is unhelpful.
For a project like this, there are really four major groups of problems that need to be tackled individually.
- Knowing and understand the sources, the kind of information that is in them. Identifying the different categories and types of data
- Figuring out the database design, so that the information identified in the first problem can be most efficiently stored.
- Data Entry, and creating the script that will put the data into the database
- Creating the front end, so that the data can be viewed, hopefully in as many different ways as possible maybe even including in a format that largely reproduces the original format in the archival volumes
So far in this project, I admit, I’ve been having difficulties considering these steps independently. As I mentioned in previous blogs, I started this project thinking first and foremost about the results of the reports that people would be looking at. Less than helpful. A better example of my wrongheadedness however has to do with the structure of the database reflecting the sources too closely. Specifically, it had to do with ADM 8 and the List Books. Let’s have another look at an example from that source just to refresh our memory.
Alright, so what we have here is a list of stations where ships were deployed, a list of ships in ordinary, etc. These lists are not just that- for each ship is also annotated the captain, the number of crew, etc. In the process of writing the previous post in this series, I realized that there was absolutely no reason why ADM 8 and ADM 107 should be separated as it would be likely that the list of ships would have some overlap, and the list of captains as well. I had in my head, the idea that tables would have arrays- for example, while I figured that the primary key (unique identifier) would be a combination of the ship’s original name and it’s launch year, I thought that the table for each ship record could have an array that would contain later names, and the years they were renamed. On top of that, they would have arrays of the stipulated crew complements, artillery establishments, captain’s names and durations and things like that. Likewise, I thought that there would be a ‘Stations’ table, which would have all the stations listed, and then have a series of arrays, once for each of the monthly reports.
The reason that I came to these design ideas is that a) it’s been too long since I worked with databases and I had forgotten how they worked b) I was not considering the four problems mentioned above distinctly. When looking at database design, the final representation of the data on the screen shouldn’t be a factor. That can be addressed later, when good code and good queries can try to reproduce the original sources.
One of the things Charlotte reminded me is that an array in a database, is really a table. That, specifically, shattered one of the blinders I had created for myself. To be honest, one of the problems I’ve had with this project is that I’ve been shying away from asking good questions because of the fear of the size of the data. I’m looking at four volumes, which is substantial amounts of information. I was worried about the proliferation of tables, and that it would be very difficult to do the data entry, and take up huge amounts of space. In terms of space- it’s text data, it’s really not that bad. But in terms of the proliferation of tables- I have to use a structure that is good for the computer, that will make the queries and functions operate as effectively as possible. Which means, I need to use the tools that MYSQL provides me. This means, take advantage of tables and that I can link things and have aliases. There is no need to replicate data and have arrays of arrays that will be efficient to work through.
So, let’s flip this around and look at actual application.
ADM 8 – in the documents they are monthly reports, with a number of different stations, each with a list of ships (and other data). ADM 107- Exam reports- they have a name (the officer examined), they have a text (which may or may not be standard), a list of ships on which the person served (with position and length of service), and they have some other data- names of captains who signed certificates etc.
For this database, it’s not just going to be only the data in the documents- for example, I’m going to use Rif Winfield’s encyclopedic British Warships in the Age of Sail series to fill in bits of information about the ships when I need to, if I need to.
Clearly, there’ll be a table for ships. The primary key will be the ship’s name and its year of launch. For example, the Royal Charles at the Restoration could be identified as ‘Naseby_1655’, while the Royal Charles built to replace it would be the ‘Royal Charles_1673’ for example. MySQL doesn’t support arrays, so instead of having a field which would be a long, comma-separated array which would have to be pulled in its entirety, parsed, and then the relevant data selected, there will be a link to another table which would contain information about that ship’s names, and then period in which those names applied. likewise, instead of the record of each ship having an array of different gun establishments, there would be a link to another table with that information. And again same with crew complements.
As you can guess, in the documents there will be a substantial number of different ships names- and that many different tables. However, designing the database to use links between tables and aliases is just the proper way to do it. It may take more space than string arrays would, but it’ll be many times more efficient in terms of data retrieval, which is the point.
There are some questions about duplication of data, which do still slightly trip me up- but that’s because I think I haven’t quite got my head fully around database structure, and it’s coming back to reproducing the original documents- because I do feel it would be good, and important, to be able to pull up effectively the original document, whether it be produced in text, in the case of the lieutenant’s exam reports, or in text or visually on a map, in the case of the deployment lists.
In terms of the exam reports, I can foresee there being a table of the reports, which would be a massively long table (b/c of the number of them- but again- don’t worry about this, I mean we don’t have to search through it manually, right?) but for each report it would consist of a bunch of links to other tables – a table for the ‘standard’ text, a link to the officers table for both the person examined and the captain(s) who signed the certificates, and to the ships table for each of the ships that was served on. But this idea comes from the idea that I want it to be easy to reproduce the report. Likewise, for the ships list, the table for the entries would have a a bunch of links to a table of stations, which would link to more links to pull up all the ships at each station for each month.
I’m really starting to think that this is the wrong way to go about it, and that I’m trying to shift ‘responsibility’ from the code in the front end to the database, and I think I’m doing this because I want to make it easier for data entry- but it’s not like whoever is doing the data entry (probably me) will be sitting at computer, manually inputting records. I’m going to put everything into a series of CSV files, which.. I can manipulate. Anyways, the next entry in this series will be about data entry, so I will discuss that revelation and how I plan to approach it then.
If I shift responsibility for recreating the original reports from the database to the front end, it suggests that I don’t need tables that will effectively recreate the reports. ADM 8 is a good example of this, for example for each ship, the record will have a link to a table, where each line is a month with a report and where that ship was deployed/whatever at the time. It’s a pretty straightforward query to search through the table of ships, and then pull out all the ships names that correspond to any month, organize them by station and present them.
I admit, I’m struggling more figuring out how to put the data from the exam reports into the database. I think that the ‘officer’ is the most important entity. There would be a table with the different versions of the text from the reports, and the information from the report would be associated with each officer. After all, even in the paper document- it’s the person that matters- it’s not like the reports are in an order that matters. And if the officers are listed in the order that their report is found in the archives, then the document could be fairly straightly reproduced. The one big wrinkle here is that it basically assumes that each officer only appears once- though that’s a thoroughly unlikely situation and that an individual sitting multiple exams (having failed all but the last) is a situation that probably exists in the documents. The way to deal with this would be, if an individual has more than one exam result associated to them, that there would be a table, with a series of links instead of a single set of links to other tables. Of course, volume and folio numbers need to be part of each record, so that proper referencing can happen.
Of course, I’m sure there’s some problems with this approach and all feedback/criticisms/comments/suggestions are appreciated.
Also, I’d like to thank Charlotte for sending me this link to a free course from the IHR titled ‘Designing Databases for Historical Research’. It’s fantastic.
In the next part of this series, I’ll be talking about the problem of data entry.