So in the last post, I was sure I had licked the problem of database design. Well, pride goeth before the fall and it turns out.. I hadn’t. That was before I had really begun to consider the problem of data entry. It was in figuring out how to take a spreadsheet full of words (or what will be a Comma Separated Values file) into the database that I realized.. I was still completely wrong. In this post, I’m going to talk about data entry, and about what I think is the *actual* solution to the database design.
But first it’s become clear that I’ve not yet actually been clear about what I’m trying to do with this project. And so I’d like to talk about that now.
The goal of this project is not to create a list of ships, or a list of officers, of a list of deployments. Frankly others have, and are doing that, and there’s no sense in replicating that. The goal of the project is to allow the power of databases to be brought upon specifically ADM 8 and 107 so that researchers can get to grips with not one report, or even ten reports, but dozens, or hundreds of reports. These documents are not individually difficult to read, but it is difficult to read many of them, and then relatively quickly find patterns. Further, this project is all about the sources- about ADM 8 and ADM 107. It certainly isn’t arguing that they are 100% accurate. I am trying to provide tools so that researchers can find text related patterns as well as other content-related patterns, and then analyze the sources as a whole, as well as the content. Also, the point of this project is not to harvest data from the sources, and then use a database to store the content and present it in a static format. I really want to allow users to harness the power of the database and ask all different types of questions, and look at the data in different ways.
I think that’s probably the best expression of what I’m trying to do yet.
Now, I *think* I have finally sorted out the database structure. Enough to actually implement it, anyways. And the two things that helped me get there were first, the process of actually some data entry and putting information into a spreadsheet. I’ve started with ADM 8, because frankly it’s a lot simpler to actually *get into*, and do some transcribing.
here’s a screenshot of what’s going on so far with the spreadsheet for data entry
I think I’m still going to make some changes. But, this is relatively close to what’s in ADM 8. From left to right, the values as written. Pretty straight forward for transcribing.
Now, before I was going too far in trying to have the database structure be different from the documents. So much so that it didn’t work. Luckily, I work at a bookstore where I use a database, and it was in using that, that I realized what I think should be my approach.
So ships and officers are two major tables, but I realized that they don’t need to have all the information associated in their row. So, I’ve created some other tables 1. ADM_8_Report- which will reflect a single line of the spreadsheet- so one per ship per month. From the image above, some things are off screen to the right. Specifically, Transcription Notes, and References. The former refers to anything that the person doing the data entry needs to note- for example, when ships have an alternate spelling. For example, in this first month (July 1673), the Royal Charles was named Charles Roye, and the St Michael was the Michael, St. So, for the main entry, I want to have the ‘proper name’- so that the script that reads in the data from the spreadsheet can . well, mostly try to avoid duplicate records for the same ship. Where there is more than one alternate spelling, I’m planning to basically concatenate the new note to the end of the previous note.
So the screenshot above has a pretty good representation of how the documents in ADM 8 actually look. let’s have a look at how they’re going to look in the database.
First, let’s look at how the ships will be stored.
As you will see- not all the fields in the Ship table are actually in the ADM 8 spreadsheet- tbh, since I have Rif Winfield’s books, I will use those to supplement the information found in ADM 8, and just provide some more context for those ships.
So the big thing that I just didn’t understand before, which made me mess up my intended structure was that I didn’t understand how tables related to each other. I thought that the ship’s table could have a reference to another whole table that would list, for example, all the captain’s for the ship. And another table would be all the different numbers of guns, another for numbers of crew members. There are two mistakes here. First, a relationship has to be between one specific record, and another, not to just another table. And second, if I did have a table that would be, for example, ‘Antelope_Captains’, and another as ‘Antelope_Crew’, both would need to have the *dates* in there, so there will be a *lot* of duplicated data. What I realized is that all the information that would be associated with that date should all be in the same record. For ships, that will be the information from each month, from ADM 8. So, I created a table called ADM_8_Report. As I mentioned before- this will be a *huge* table- one per ship per month. But if the clunky old computer at work can take only a few seconds to go through literally tens of thousands of sale orders going back a decade or more, then I don’t need to worry about so many individual reports. Below is the structure of that table.
So here we are. For those who are wondering, these screenshots are from phpmyadmin, which is what I use to manage the MySQL database.
So, as you can tell, it’s a bit of a mess. First, the index- I don’t know what it’s going to be, I need to sort out a function for creating the primary key. Each report will be directly linked back to both the ship, and the captain. This structure will allow me to track the recorded numbers of crews and guns for each ship over time without repeating data, but also allows for complexity in the data- for example, are ships part of the fleet (in 1673 there was a fleet- noted in the records as Red or Blue (a or b or A or V or whatever). But also, there’s things like ‘Escorting a Convoy to Constantinople’. What I’ve actually just done in the middle of writing this blog post, add another column, so that in the spreadsheet the location and what they’re doing there are distinct, and can be added to the database distinctly without having to do some parsing when reading the data from the file. But I think this structure will help because this way I avoid one column having different types of information in it that really should be separate.
The other thing that should be mentioned at this point is that I’ll be using PHP to read the file, and then to write to the database. I’ll be using a command that reads a line of the file at a time, so I’m going to keep on futzing with the structure of the excel spreadsheet until I can figure out how to optimize it between what’s efficient for coding, and what’s efficient for data entry.
You may have noticed, I’ve not talked about ADM 107 at all- that’s because I’m still trying to figure out how I want to capture all the text from the source into a database- they are a much more *written* type of document. There is a lot of text in the document- what looks like boiler plate, but when you read more than a handful it becomes clear that there is a lot of variation. So I need to put more thought into how I’m going to extract that information, and store it.
And that will be the subject of the next blog. In the mean time, I’m going to keep on going with this data entry and coding.