Thursday, October 4, 2018

A Brief Analysis of Text and ASCII Code

To verify that the Lexos hierarchical clustering returns an identical dendrogram whether, in text or in ASCII code, I first cleaned Nathaniel Hawthorne's Blithedale Romance leaving the capital letters as is. I then converted each letter into its ASCII equivalent and then wrote the lines to a new file. I then had two files.
The text file looks like this:
I OLD MOODIE The evening before my departure for Blithedale I was returning to my bachelor apartments after attending the wonderful exhibition of the Veiled Lady when an elderly man of rather shabby appearance met me in an obscure part of the street of my story The reader therefore since I have disclosed so much is entitled to this one word more As I write it he will charitably suppose me to blush and turn away my face I I myself was in love with Priscilla
And, the ascii "out-1.txt file looks like this:
073 079076068 077079079068073069 084104101 101118101110105110103 098101102111114101 109121 100101112097114116117114101 102111114 066108105116104101100097108101 073 119097115 114101116117114110105110103 116111 109121 098097099104101108111114 097112097114116109101110116115 097102116101114 097116116101110100105110103 116104101 119111110100101114102117108 101120104105098105116105111110 111102 116104101 086101105108101100 076097100121 119104101110 097110 101108100101114108121 109097110 111102 114097116104101114 097 115104097098098121 097112112101097114097110099101 109101116 109101 105110 097110 111098115099117114101 112097114116 111102 116104101 115116114101101116 111102 109121 115116111114121 084104101 114101097100101114 116104101114101102111114101 115105110099101 073 104097118101 100105115099108111115101100 115111 109117099104 105115 101110116105116108101100 116111 116104105115 111110101 119111114100 109111114101 065115 073 119114105116101 105116 104101 119105108108 099104097114105116097098108121 115117112112111115101 109101 116111 098108117115104 097110100 116117114110 097119097121 109121 102097099101 073 073 109121115101108102 119097115 105110 108111118101 119105116104 080114105115099105108108097
Both files are much longer than the above samples. The resulting dendrogram from the default settings of Lexos's hierarchical clustering looks like the image below. Heuristically the computer program of Lexos interprets each file exactly the same, and yet each file is very different. The slightly obtuse python code that I used to create the ASCII file and the text file is below. I can make it much prettier, but for now, it is what it is.
\#!/usr/bin/env python3
\# -*- coding: utf-8 -*-
"""
Created on Sun Apr  9 08:49:53 2017

@author: ray

"""
\# An alternative to replacing brackets and parentheses by using regex within python
def remove_bracketed_text_by_regex(text):
   import re
\#    text = re.sub("\(.+?\)", " ", text)         # Remove text between parentheses
\#    text = re.sub("\[.+?\]", " ", text)         # Remove text between square brackets
\#    text = re.sub("\s+", "  ", text).strip() # Remove extra white spaces (optional)
   return text

\# A loop subroutine def that will remove nested brackets and parentheses
def remove_text_inside_brackets(text, brackets="()[]"):
   count = [0] * (len(brackets) // 2) # count open/close brackets
   saved_chars = []
   for character in text:
       for i, b in enumerate(brackets):
           if character == b: # found bracket
               kind, is_close = divmod(i, 2)
               count[kind] += (-1)**is_close # `+1`: open, `-1`: close
               if count[kind] < 0: # unbalanced bracket
                   count[kind] = 0
               break
       else: # character is not a bracket
           if not any(count): # outside brackets
               saved_chars.append(character)
   return ''.join(saved_chars)

\# the cleanstring subroutine def below substitutes one character for another or for nothing if the second quote is left empty. Modify as needed
def cleanString(incomingString):
   newstring = incomingString
   newstring = newstring.replace("a","097")
   newstring = newstring.replace("A","065")
   newstring = newstring.replace("b","098")
   newstring = newstring.replace("B","066")
   newstring = newstring.replace("c","099")
   newstring = newstring.replace("C","067")
   newstring = newstring.replace("d","100")
   newstring = newstring.replace("D","068")
   newstring = newstring.replace("e","101")
   newstring = newstring.replace("E","069")
   newstring = newstring.replace("f","102")
   newstring = newstring.replace("F","070")
   newstring = newstring.replace("g","103")
   newstring = newstring.replace("G","071")
   newstring = newstring.replace("h","104")
   newstring = newstring.replace("H","072")
   newstring = newstring.replace("i","105")
   newstring = newstring.replace("I","073")
   newstring = newstring.replace("j","106")
   newstring = newstring.replace("J","074")
   newstring = newstring.replace("k","107")
   newstring = newstring.replace("K","075")
   newstring = newstring.replace("l","108")
   newstring = newstring.replace("L","076")
   newstring = newstring.replace("m","109")
   newstring = newstring.replace("M","077")
   newstring = newstring.replace("n","110")
   newstring = newstring.replace("N","078")
   newstring = newstring.replace("o","111")
   newstring = newstring.replace("O","079")
   newstring = newstring.replace("p","112")
   newstring = newstring.replace("P","080")
   newstring = newstring.replace("q","113")
   newstring = newstring.replace("Q","081")
   newstring = newstring.replace("r","114")
   newstring = newstring.replace("R","082")
   newstring = newstring.replace("s","115")
   newstring = newstring.replace("S","083")
   newstring = newstring.replace("t","116")
   newstring = newstring.replace("T","084")
   newstring = newstring.replace("u","117")
   newstring = newstring.replace("U","085")
   newstring = newstring.replace("v","118")
   newstring = newstring.replace("V","086")
   newstring = newstring.replace("w","119")
   newstring = newstring.replace("W","087")
   newstring = newstring.replace("x","120")
   newstring = newstring.replace("X","088")
   newstring = newstring.replace("y","121")
   newstring = newstring.replace("Y","089")
   newstring = newstring.replace("z","122")
   newstring = newstring.replace("Z","090")
   newstring = newstring.replace('\/',' ')
   newstring = newstring.replace('"',' ')
   newstring = newstring.replace('.', ' ')
   newstring = newstring.replace(',',' ')
\#    newstring = newstring.replace('\\n','')
   return newstring

f2 = open(r'C:\Users\rayst\Documents\525-DH\texts-for-analysis\ascii\output\Hawthorne-blithedale-romance--ascii-out-1.txt', "w") # open a new file to write to.
\#much of the stuff below is commented out and only there for convenience.
\# the following "for loop" runs the above subroutine defs
\# with open('commentarymagazine_humanities_urls.json', 'r', encoding='utf-8') as f:\n
\# on each line of text in the input file
\# When it reaches the end of file it breaks out of loop.
for line in open(r'C:\Users\rayst\Documents\525-DH\texts-for-analysis\ascii\Hawthorne-blithedale-romance-ascii.txt', "r"):
\# Uncomment the following four lines of code to remove from an asterisk to the end of line. Yeah, so, these are line operations.
\#    head, sep, tail = line.partition('*.')
\#    line = (head)
\#    head, sep, tail = line.partition('Lines')
\#    line = (head)
\# The following line of code removes nested brackets/parens within a line
   line = (repr(remove_text_inside_brackets(line)))
\# The commented line below offers an alternative to the above loop by using regex
\#   line = remove_bracketed_text_by_regex(line)
   line = (repr(cleanString(line))) # this calls the above cleanstring sub for each line
   line = line.replace("'"," ") # this gets rid of any remaining apostrophies
   line = line.replace('\\n',' ')
   line = line.replace('\\',' ')
   line = line.replace("\""," ") # this gets rid of any remaining commas
   line = line.replace('!',' ')
   line = line.replace('?',' ')
   line = line.replace('-',' ')
   line = line.replace(';',' ')
   line = remove_bracketed_text_by_regex(line)
   print(line) # this prints the output to the file in the console screen for monitoring
   f2.write(line) # this writes the line to the cleaned output file
\#    f2.write(r"\n") # this appends each line with a newline

f2.close() # this closes the output file
What I'm getting at is the bias in Topic Modeling and other Digital Humanities research. I'd like to develop a blind experiment to establish the extent to which Digital Humanities is objective. I've been thinking about this since objectivity came up in our WE1S Summer Research Camp discussions. I'm thinking about how to prove with hard evidence that DH is a science. I'm thinking about a mathematician, or geneticist, and a digital humanist arriving at the same technical observations of the outputs--the results produced by the DH science. Perhaps this has all been done before, but in any event, I proved to myself with a fun little experiment that the computer doesn't make choices based on what the tokens are. It processes the relationships (maybe not the order but probably the quantity) of the tokens to one another. The experiment verifies that this is so despite the underlying mathematical proof of hierarchical cluster modeling.
Does Hierarchical Clustering produce different results based on the order of the tokens? If I were a mathematician, I might believe it one way or another. But, this test helps me to visualize what is taking place. The numbers refer to the dendrogram and the word file as much as to any equivalent token set, and it doesn't depend on whether the equivalent token file is made from a 1 to 1 exchange of hexadecimal, binary or pictograph equivalents. I guess what is next needed is something like a diagram that expresses the flow of meaning, from our semiotic registers into the 1s and zeros and then back. What is taking place? Yes, an analysis, but where does that analysis lose those to whom we are advocating for the DH? Our entire world is being digitized and even our brains. What are we losing in the process and how can Digital Humanities help lessen the loss?
The same text to ascii test applied with an actual ascii to text online converter and the topic modeling tool for use that we downloaded gave me the following words from Topic 0 when I converted them from the ASCII coded file of Blithdale Romance: window drawing room hotel curtain front windows city extended dove steps All area places houses curtains boarding cat return doors
When I ran the topic modeling tool with the text version file Topic 0 came out as the following: boat wrong river effort borne act bent coffin drift shore shoe emotion sobs dry yonder methinks base tragedy betwixt tuft
Everything was the same, but likely the seed started at a different token. I need to check into that. Otherwise, I'll have to investigate why there is a difference in topic modeling the ASCII encoded equivalent of a text file.
Continuing on, I ran the topic model again with the same exact set of text files (I used Lexos to cut the files above into 9 segments and then downloaded them. Note that the download segemented files on the Lexos || prepare || cut menu did not seem to work for me so I had to download the cut files from the Manage menu), but this time I got the following list of topics for Topic 0: fauntleroy chamber wealth splendor daughter corner glass saloon drink moodie governor liquor wife condition supposed beautiful cocktails gin feeble message
This indicates to me that the topic modeling tool uses a different seed token to generate Topic 0. Because of the different Topic 0s produced whenever a different seed token is used, I have to question the validity of the topic modeling tool or look into what others say about this issue. If topics may relate to themes based on relative coherence then what I would like to see is an average Topic 0 made from the sums of analyses run from all possible seed tokens. I'd like to concentrate on other topic modeling variables instead of having such an open variable as part of my research.
The above tests helped me realize
  1. the meanings of the words are equivalencies of human awareness that the world and life exist.
  2. the meaning of the words from a novel written almost 200 years ago remains unchanged even when converted to another code. Humanly readable topics were determined after I switched the topic modeling results back from their ASCII codes to text.
  3. the meaning of the words may be analyzed by processes outside the scope of the analyst's knowledge.
  4. semantic relationships between words result from computer processes. The topics did seem to have a bit of coherence. On other topic models with rubust data sets, topics have suprisingly uncanny coherence.
  5. the code being processed by another auxiliary code is the artificial processing of the awareness that the world and life exist.


No comments:

Post a Comment

A Digital Humanities Study of Reddit Student Discourse about the Humanities

A Digital Humanities Study of Reddit Student Discourse about the Humanities by Raymond Steding Published August 1, 2019 POSTED ...