Friday, December 14, 2018

Digital Humanities Final Paper 2018

13th Dec 2018
Reddit
By Ray Steding

The 525DH WE1S team began by aligning itself with the project goals of the UCSB Students and the Humanities Team as set out by Abigail Droge. Their “Team Overview” states: “The goal of this team is to understand the interactions between US college students and the humanities. We will approach the question from multiple perspectives, including those of educators, students, and institutions.” The focus of the project is to contextualize the discourse that students have about the humanities. Abigail’s project includes more than this as to methodology and hypothesis. With the understanding of her research goals and with the communications between the UCSB "Students and the Humanities Team" and the "525DH WE1S" team through the course of the semester, our team set out to gather meaningful data to supplement and to model.

Joyce Brummet became our WE1S project leader during one of our in-class meetings in which we decided to separate into two groups of data inquiry: The Twitter team and the Reddit team. Since I have experience with gathering data I assisted when called on. I focused on how to obtain, process, and interpret the Reddit data.

Initially, I searched out and found a large Reddit file from doing Internet searches, but it turned out not to contain the Reddit comments of interest. I continued down a Google thread to find that Google offered a product known as Google BigQuery and they give anyone a $300 free credit line to begin doing searches on what they have available. Through BigQuery I obtained all of the available, and for my purposes, usable Reddit comments from all of the community colleges, the California State colleges, the UC system colleges, and the ivy league colleges. Additionally, I downloaded subReddit comments from the askReddit, education, explainlikeimfive, politics, soccer, applyingtocollege, and LeagueOfLegends.

With that data available to me I began to write the software to convert it into JSON files that would be accepted by the UCSB “harbor” server notebooks Finally, I produced the Reddit topic models. In recap, this is the process that I used:
  • Query Google’s database and download the results as source files in JSON format
  • Grep the source files for the search terms humanities, liberal arts, the arts, and STEM.
  • Take the resultant files and process them through python code such that the files conformed to the WE1S Workflow Management System.
  • Upload the files to a harbor server data directory as zipped files.
  • Run the jupyter notebooks on the server to process the files into topic models aligned with the goals of the project.
The Google BigQuery SQL command that I used follows:
  1. SELECT author, body
  2. FROM fh-bigquery.reddit_comments.2018*
  3. WHERE subreddit = 'UCLA'
  4. AND REGEXP_CONTAINS(body, r'(?i)\bhumanities\b|\bliberal arts\b|\bthe arts\b')
  5. LIMIT 16000
The resultant JSON files from queries such as the above were lists of all the comments in the acquired subReddits which contained the search terms humanities, liberal arts, the arts, and STEM.

To get all of the BigQuery files into a format that the UCSB jupyter notebooks accepted I had to write python code. I wrote the programs as separate cells in jupyter notebooks that I ran locally. The Jupyter notebooks created are located at the end of this post.

After processing the files into their acceptable WE1S format I zipped them and uploaded them into the /data/data-scraped/CSUN-Reddit-Data subfolder on the UCSB server here. I often altered the code depending on what needed to be accomplished. For instance, I added the textblob sentiment values to the bibliography of the DFR Browser by including the value in the schema title field.

All of the topic models produced are in a Google spreadsheet here.

The topic models within the spreadsheet have preliminary notes with a brief description of what I hoped to accomplish by producing the model. All of the models presented different themes (topics) that ran through the corpora included in the file list of the import topic model data jupyter notebook 1. I always looked at the first five topics in the models and read through the first five comments in each of the five topics. I also looked at the topics that had the highest representation within the model.

While making the first of the topic models with BigQuery subReddit data, a torrent file that I had requested about a month earlier for all of the 2017 Reddit comments automatically downloaded onto my computer. The new 2017 data files, some 122 GB of comments in three JSON list formatted files, presented me with the opportunity of modeling an entire year's worth of Reddit comments. I could analyze what every 2017 Reddit commenter says about the humanities. So, I did, but I first had to again get the data from the source files into an acceptable format for the UCSB jupyter notebooks. The two programs that I wrote were slight adaptations from the first code.

The coding of the last two programs shows improvement from the first two processing programs and may be compared with the first two programs at the end of this post. The longer that I used Google searches and StackOverFlow the easier it became to get the code to do what I wanted and needed it to do. Working with files as lists of elements reminded me of working with arrays in other languages, but it is not an intuitive process and many of the examples on StackOverFlow and other sites did not include a solution to what I looked for.

After clearing up the coding issues I then had to deal with a pesky problem on the UCSB notebooks. I sometimes received the following error:

Error in load_from_mallet_state(mallet_state_file = paste0(modelDir, "/", \: hyperparameter notation missing. Is this really a file produced by mallet train-topics --output-state?

I still don’t know whether the problem is within one of the many thousands of source files or within notebook 4 make topic model cell 7 topics. Sometimes the files would go through and other times the same files would not. Sometimes if I created a whole new project it worked and other times not. I always ran the three jupyter notebooks in sequence, but it didn’t help.

Despite the difficulties, the research models gave me insight on many topic model questions such as does a balanced model comprised of equal file sizes from different subReddits result in a more homogeneous mix of topic files present within a topic? And, does representativeness become more apparent when both file size and distribution across subReddits comprises a corpus? The answers to both questions are yes. Another question that arose out of looking at the topic models was why do comments from specific subReddits align as the majority of files under specific topics? The files representing a topic often were all from one subReddit. This tendency became greatly exaggerated in the models wherein the corpora are from the University Wire and subReddits. Files tended to constellate according to whether they were University Wire or subReddits even when their file sizes equaled each other. In any event, it also became apparent that although different search term articles may be mixed within a topic model, different genres of corpora may not. Thus, Twitter, for the most part, should be its own genre and not mixed with University Wire or subReddits; genres should be analyzed independently of each other.

Persevering past the point of doubt

Although I’m still in the early stages of working with the models, the models have given me first-hand experience with the terrain of discourse used by different sectors of the commenters. My interest is in the discourse of two opposing groups of students: humanities students and STEM students. First, I want to know their reasons for taking their majors and for postgraduate students I want to know how they feel about the results as they seek employment. Secondly, I want to know the sentiment and subjectivity values to help analyze the feeling tone of student discourse. A noticeable type of discourse seems almost unique to each subReddit and this may be part of the reason specific subReddits align under specific topics. Additionally, moderators keep the commenters in line. Since I’m most interested in why students believe few job opportunities exist for humanities majors I became excited when I found a topic that included mostly comments about a crisis in STEM from a surplus of STEM field workers.

This led me to the last two models created this semester. Topic model M and topic model N have corpora comprised of data with the humanities search terms and the word crisis and the search term STEM and the word crisis. What becomes immediately apparent when looking at these models is that the comments are unusually long comments. It seems like the word crisis comes with an earnest explanation. Actually, two dissimilar words are less likely to appear in smaller documents than they are within a larger context. I’ve read many of the crisis comments but am not through with them, and so I don’t have any conclusions at this point.

When I began work on the What Every 1 Says project during Summer Camp I thought about what I would like to topic model. What first came to mind was the crises that lead one to make difficult decisions in life such that they become more than what they are at the time. Something like those moments of indecision where there are no good answers except to proceed in one doubtful way or another. On both an individual level as well as a collective level crisis leads to something greater. I also liked the idea since I’m a runner of somehow modeling motivation that pushes one past the point of absolute doubt on to championship in whatever field or discipline. So to me, the question of a crisis in the humanities defines the conditions representative of the barrier of doubt that must be overcome. What is present in the mind of a researcher, engineer, scholar, professional of any field that gives them the perseverance to push past the doubt of their hypothesis? What is the discourse of the students as they push past their parent's ideals? I presume in the context of students that the barrier I’ve been reading about is the belief that they will have no future should they become humanities majors. As one commenter bluntly states: "here's my story I graduated college almost years ago with a useless degree in humanities it took me more than a year to find a full-time job that paid less than dollar k prior to finding a full-time job." But, maybe that condition is a subset within the larger meaning of crisis--the crisis of being human; the human condition; the human condition that perseverance changes.

Researching two different forms of a crisis in the humanities

This is my personal narrative from a few days ago that relates to choosing what to research.

I woke up from having a dream on how to add more material to my topic model analysis. I remembered from the dream a looping of information from the Liberal Arts data files back into the topic model source material. I watched as each iteration changed the model. It must have been from what I was reading on my screen not long before I went to bed. To be clear, when I was awake before going to bed that night I talked with my roommate about a personal interest that I had about topic modeling research that might include analyzing the way people faced the difficult choices in life; the choices that caused them to become more than who they are: the fulfillment of their potentialities. Well, on the screen when I woke up, by coincidence, was the following Reddit comment which is a short story:

i ignored the doorbell as it rang for the seventeenth time in the past four minutes i had work to do the study of neuroparasitology was not to be halted for girl scout
cookies i used amazon like a normal person besides i was pretty sure i was close to discovering a creature at least partially responsible for aberrant behavior in dogs
then i heard my door open the telephone rang again but i left it i hadn`t slept in hours and whoever kept calling could wait until my work was complete a moment later
there was a cough from behind me i waved my hand absently it was probably gary with some insignificant question about how a section of the brain functioned during
mating and how that would affect a mite hidden nearby honestly i had no idea how he got his phd i finished making my notes and turned around i had finished my pre
lecture sigh during the chair spin and was preparing the usual diatribe when i noticed three large men in suits were there instead of my idiot colleague startled but not
unduly alarmed i wondered if they were investors sir youre going to have to come with us the higher ups want to talk to you i was now quite concerned that the private
company funding my important research was going to pull funds or had made a deal with some mafia types ridiculous but not impossible i frowned at him but nodded
assent as i went to put on some pants the benefits of working from home meant i usually didnt have to worry about such stupid social delicacies as pants but id make
an exception to keep my job i followed the bulky men noticing guns and an alertness that spoke to their quality i noticed that my door was still in tact as we left the
building and that there was a couple of guards waiting outside who locked up after me i was led from my small house to one of four large black suvs on the street i
spent the next thirty minutes thinking about where my quarry might be hiding it clearly wasnt a stem parasite but it affected behaviors influenced by both lobes i had
been confident where id been looking but it hadnt been there i made some notes on my lab coat notebook eventually i arrived at a large building it was a government
building of some kind it wasnt clear what it was for but it had all the look of petty bureaucracy it looked like my afternoon was going to be wasted on paperwork and
financial inquiries i was lead to an office on the top floor not a good sign and everyone in the building seemed to be straining to get a look at me i was glad id
remembered to put on pants sometimes i forgot and everyone got in a tizzy i was motioned into a room and saw a large table and an empty chair close to me on the
other side was a line of people seated comfortably with an extensive spread of documents in front of them i sat down and looked at the documents waiting for them to
finish their arbitrary introductions this didnt look to be about neuroparasitology at all i caught the last couple of names senators of some such we brought you here
because after an extensive and exhaustive search we have uncovered that you alphonse anderson are the most intelligent person on the planet we have arranged for
you solve all the big questions posed to our species about time you idiots have been bumbling about these questions for years all the infrastructure is there and no
one gets how to use it its all the bloody paperwork a man from the corner stepped out and placed stacks of folders down in front of me each listed with a major crisis
each of these folders tells you the resources at your disposal and the problem that needs to be solved she droned on some more but i wasnt listening i scanned the
first folder homelessness and tossed it towards them put the homeless in a warehouse and have them work on assembling care packages to send abroad i continued
food is easy its all there it just has to rounded up and distributed instead of being left to rot reduce the best before dates and ship them to impoverished nations for
single day items like hamburgers or whatever give them to local food banks and shelters after closing i went through the files easily putting matters to rest after i was
halfway through the stack i came to one that gave me some trouble the meaning of the universe i asked the stunned board in front of me well yes one of them
stammered back at me no scientific solution ask some philosopher about your drivel but its the most important question there one of the women shouted she was
minister of something or another probably something in the arts with an attitude like that all the people ive saved will have plenty of time to ponder that once they stop
dying to make a determination on that id need more information than humanity has at its disposal if you want my personal opinion i would say there is no purpose to
the universe no divine entity and no driving unseen force its a story of star dust biology and luck no sooner had i finished my sentence than everyone in the room
began to scream i sat back perplexed as they wailed with a kind of agony that i hadnt even imagined i could see fires start in the building across the street cars
suddenly swerve into each other in the roads below id seen the microphone from the start of course but i hadnt been aware that the world was listening live over the
next few weeks death tolls mounted my policies to reduce the suffering of mankind more for efficiencys sake than sympathy had ended suddenly when everyone
despaired in the face of a cruel reality it was only a matter of time until everything broke down still i was pretty happy with the results id been allowed to go back to my
research after all.
The story has elements of looping within it. The microphone left on broadcasts out to the masses. The protagonist likely became infected by the creepy parasites that control their hosts. The parasites which he's become are the subject of the research. It is still open in another tab in my browser and is a story written within the “The Arts” search term result from grep searching the 2017 Reddit files. It is an exhibit so to speak, a real-life example, of the kind of research that I was talking with my roommate about. I did not read the whole story closely until I awoke. It’s an example of the “looping” of material back into the topic model results. In my case, it combined with my discussion with my roommate, formed into a dream, and resulted in a realization, but a realization of what I hoped to find with topic modeling at the beginning of Summer Camp. It has a fuzzy logic about it--it’s kind of spooky. The reason is, is that the last line seems to explain what I’d been telling my roommate; what I’m interested in researching: “suddenly when everyone despaired in the face of a cruel reality it was only a matter of time until everything broke down still I was pretty happy with the results id been allowed to go back to my research” (anonymous, 2017, Reddit comment): the reality that breaks down and leads one to persevere and overcome is what I'm happy to research.

In conclusion, what I thought a crisis in the humanities meant to me as opposed to what a crisis in the humanities means to the WE1S project, materialized as a legitimate path of research distinct from the project goals of WE1S, but yet not so different in topic modeling procedure. My path forward like the protagonist in the story above says, “I’d been allowed to go back to my research” is what in one way symbolizes a resolution of a crisis in the humanities. The fictional story represents the efforts of a struggling author as he or she confronts and overcomes their personal crisis in the humanities. On the way to becoming more than a fantasy about potentialities, they are writing it out and affecting the world with their words.



The Following two python cells process the Google BigQuery JSON files.

2018-11-21-JSON-Metadata-searchterm-sentiment is located on the 525 DH WE1S Google Drive here.


'''2018-11-21-JSON-Metadata-searchterm-sentiment.pynb Author raysteding@gmail.com
This file takes the .json Reddit list files from Google BitQuery and cleans them. Many things such as the
exclusion of the comments with STEM in them could have been handled better, but it was a fun learning experience.
'''
import json
import re
import pprint
import os
import glob
import codecs
# set the path to the .json list files that you wish to process into .txt files
# why do this when they need to be .json? It is an option that I chose. You can change it
# The code assumes that only the .json source files exist in the source directory although subdirectories are ok.
path = r'C:\Users\rayst\Documents\525-DH\software\dev\Lexos-cleaned\selected_documents'
#set the counters to zero
count = 0
cnt = 0
cnt2 = 0
# Process every .json source file from Google's BigQuery that is in the above source directory
for filename in glob.glob(os.path.join(path, '*.json')):
filename2 = filename
filepath2 = filename2.replace(".json", "-")
fp = codecs.open(filename, 'r', 'UTF-8')
filepath2 = filepath2 + str(cnt) + '.txt'
outfile = open(filepath2, 'w', encoding='UTF-8')
cnt += 1
cnt2 = 0
# next omit the comments about stem cells because they most likely refer to biology
# rather than related to STEM vs. humanities
for line in fp:
cnt2 += 1
word = line
word2 = line
word3 = line
word4 = line
result = word.find('stem cells')
result2 = word2.find('Stem cells')
result3 = word3.find('stem cell')
result4 = word4.find('Stem cell')
if (word.find('stem cells') != -1) or (word2.find('Stem cells') != -1) \
or (word3.find('stem cell') != -1) or (word4.find('Stem cell') != -1):
line = fp.readline()
# If the comment isn't about biology then process the comment. Note that each line is an entire comment.
else:
# Get rid of the username of the comment because it is not needed
first,second = line.split('\"body\":')
line = second
# Clean the line. Here you can add anything that you wish to get rid of or alter.
line = line.replace('|',' ')
line = line.replace('\"',' ') ; line = line.replace('(', ' ')
line = line.replace(')',' ') ; line = line.replace('}',' ')
line = line.replace('{',' ') ; line = line.replace('/n/n',' ')
line = line.replace('/n',' ') ; line = line.replace('{', ' ')
line = line.replace('\\',' ') ; line = line.replace('\n\n',' ')
line = line.replace(' n n',' ') ; line = line.replace(' n ',' ')
line = line.replace('[',' ') ; line = line.replace(']',' ')
line = line.replace('*',' ') ; line = line.replace('&',' ')
line = line.replace('/',' ') ; line = line.replace('\\',' ')
line = line.replace('\'','') ; line = line.replace(',',' ')
line = line.replace('.',' ') ; line = line.replace(':',' ')
line = line.replace('0',' ') ; line = line.replace('1',' ')
line = line.replace('2',' ') ; line = line.replace('3',' ')
line = line.replace('4',' ') ; line = line.replace('5',' ')
line = line.replace('6',' ') ; line = line.replace('7',' ')
line = line.replace('8',' ') ; line = line.replace('9',' ')
line = line.replace('-',' ') ; line = line.replace('$','dollar ')
line = line.replace('%','percent ') ; line = line.replace(';',' ')
line = line.replace('_',' ') ; line = line.replace('_____',' ')
line = line.replace('?',' ') ; line = line.replace('!',' ')
line = line.replace('&',' ') ; line = line.replace('',' ') ; line = line.replace('tp',' ')
line = line.replace('deleted','') ; line = line.replace('removed','')
line = line.replace('ufffd','') ; line = re.sub('\s+', ' ', line)
# Change all the text to lower case
line = line.lower()
# This newline is needed to enable the search terms to be searched for and written out to a seperate file
line = line + '\n'
# print(line)
outfile.write(line)
# print (count)
count += 1
if count == 50:
outfile.close()
filepath2 = filename2.replace(".json", "-")
filename3 = filepath2 + str(cnt2) + '.txt'
outfile = open(filename3, 'w', encoding='UTF-8')
count = 0
cnt2 += 1

line = fp.readline()
outfile.close()
fp.close()
#filepath2.close()
#It might be a good idea to convert "&" tp "&", ">" to ">", "

'''This file takes the output txt files from the above program, and seperates comments out into four files
each with one of the key search terms as a condition for being written.
The file pattern is search term, sentiment value, filename.
'''
from textblob import TextBlob
import json
import re
import pprint
import os
import glob
import codecs
from textblob import TextBlob

text = ''''''

blob = TextBlob(text)
blob.tags # [('The', 'DT'), ('titular', 'JJ'),
# ('threat', 'NN'), ('of', 'IN'), ...]

blob.noun_phrases # WordList(['titular threat', 'blob',
# 'ultimate movie monster',
# 'amoeba-like mass', ...])

#for sentence in blob.sentences:
# print(sentence.sentiment.polarity)
# 0.060
# -0.341

#blob.translate(to="es") # 'La amenaza titular de The Blob...'

# set the path to the .json list files that you wish to process into .txt files
# why do this when they need to be .json? It is an option that I chose. You can change it
# The code assumes that only the .json source files exist in the source directory although subdirectories are ok.
path = r'C:\Users\rayst\Documents\525-DH\software\dev\Lexos-cleaned\selected_documents'
#set the counters to zero
count = 0
cnt = 0
cnt2 = 0
stem_cnt = 0
humanities_cnt = 0
liberal_arts_cnt = 0
the_arts_cnt = 0
#wiki = TextBlob("Python is a high-level, general-purpose programming language.")
# Process every .json source file from Google's BigQuery that is in the above source directory

for filename in glob.glob(os.path.join(path, '*.txt')):
fp = codecs.open(filename, 'r', 'UTF-8')
# filepath2 = filepath2 + str(cnt) + '.txt'
# outfile = open(filepath2, 'w', encoding='UTF-8')
cnt += 1
cnt2 = 0
for line in fp:
cnt2 += 1
word = line
word2 = line
word3 = line
word4 = line
result = word.find('humanities')
result2 = word2.find('liberal arts')
result3 = word3.find('the arts')
result4 = word4.find('stem')

# process the the arts comments
if (word.find('humanities') != -1):
stem_cnt += 1
blob = TextBlob(word)
print(line,'\n')
for sentence in blob.sentences:
sent = ("%.3f" % round(sentence.sentiment.polarity ,3))
sent = str(sent)
sent = sent.replace('-', 'neg-')
print(sent, '\n')
# Capture the path into a list: list1[9] is the file name
list1 = (filename.split('\\'))
# list2 becomes the date
list2 = (list1[9].split('-'))
newfname = path + '\\' 'humanities_' + sent + '_' + str(humanities_cnt) + '_' + list1[9]
newfname = newfname.replace('.txt', '.json')
document = 'humanities_' + sent + '_' + str(humanities_cnt) + '_' + list1[9]
document = document.replace('.txt', '.json')
print(newfname)
outfile = open(newfname, 'w', encoding='UTF-8')
list3 = ('{"doc_id": ' + '"' + document + '",' + '"attachment_id\": \"none\",\"pub\": \"Reddit\",\"pub_date\": \"' \
+ list2[1] + '\",\"length\": \"1500\",\"title\": \"' + document + '",' + '"name\": \"' \
+ document + '",' + '\"url": \"Google BigQuery\",\"namespace\": \"we1sv2.0\",\"metapath\": \"Corpus,' \
+ list1[9] + ',' + 'RawData\",\"content\": \"' + line)
list3 = list3.splitlines()[0]
list3 = (list3 + '\"}')
outfile.write(list3)
outfile.close()

# process the liberal arts comments
if (word2.find('liberal arts') != -1):
stem_cnt += 1
blob = TextBlob(word)
print(line,'\n')
for sentence in blob.sentences:
sent = ("%.3f" % round(sentence.sentiment.polarity ,3))
sent = str(sent)
sent = sent.replace('-', 'neg-')
print(sent, '\n')
# Capture the path into a list: list1[9] is the file name
list1 = (filename.split('\\'))
# list2 becomes the date
list2 = (list1[9].split('-'))
newfname = path + '\\' 'liberal_arts_' + sent + '_' + str(liberal_arts_cnt) + '_' + list1[9]
newfname = newfname.replace('.txt', '.json')
document = 'liberal_arts_' + sent + '_' + str(liberal_arts_cnt) + '_' + list1[9]
document = document.replace('.txt', '.json')
print(newfname)
outfile = open(newfname, 'w', encoding='UTF-8')
list3 = ('{"doc_id": ' + '"' + document + '",' + '"attachment_id\": \"none\",\"pub\": \"Reddit\",\"pub_date\": \"' \
+ list2[1] + '\",\"length\": \"1500\",\"title\": \"' + document + '",' + '"name\": \"' \
+ document + '",' + '\"url": \"Google BigQuery\",\"namespace\": \"we1sv2.0\",\"metapath\": \"Corpus,' \
+ list1[9] + ',' + 'RawData\",\"content\": \"' + line)
list3 = list3.splitlines()[0]
list3 = (list3 + '\"}')
outfile.write(list3)
outfile.close()

# process the the arts comments
if (word3.find('the arts') != -1):
stem_cnt += 1
blob = TextBlob(word)
print(line,'\n')
for sentence in blob.sentences:
sent = ("%.3f" % round(sentence.sentiment.polarity ,3))
sent = str(sent)
sent = sent.replace('-', 'neg-')
print(sent, '\n')
# Capture the path into a list: list1[9] is the file name
list1 = (filename.split('\\'))
# list2 becomes the date
list2 = (list1[9].split('-'))
newfname = path + '\\' 'the_arts_' + sent + '_' + str(the_arts_cnt) + '_' + list1[9]
newfname = newfname.replace('.txt', '.json')
document = 'the_arts_' + sent + '_' + str(the_arts_cnt) + '_' + list1[9]
document = document.replace('.txt', '.json')
print(newfname)
outfile = open(newfname, 'w', encoding='UTF-8')
list3 = ('{"doc_id": ' + '"' + document + '",' + '"attachment_id\": \"none\",\"pub\": \"Reddit\",\"pub_date\": \"' \
+ list2[1] + '\",\"length\": \"1500\",\"title\": \"' + document + '",' + '"name\": \"' \
+ document + '",' + '\"url": \"Google BigQuery\",\"namespace\": \"we1sv2.0\",\"metapath\": \"Corpus,' \
+ list1[9] + ',' + 'RawData\",\"content\": \"' + line)
list3 = list3.splitlines()[0]
list3 = (list3 + '\"}')
outfile.write(list3)
outfile.close()

# process the stem comments
if (word4.find('stem') != -1):
stem_cnt += 1
blob = TextBlob(word)
print(line,'\n')
for sentence in blob.sentences:
sent = ("%.3f" % round(sentence.sentiment.polarity ,3))
sent = str(sent)
sent = sent.replace('-', 'neg-')
print(sent, '\n')
# Capture the path into a list: list1[9] is the file name
list1 = (filename.split('\\'))
# list2 becomes the date
list2 = (list1[9].split('-'))
newfname = path + '\\' 'stem_' + sent + '_' + str(stem_cnt) + '_' + list1[9]
newfname = newfname.replace('.txt', '.json')
document = 'stem_' + sent + '_' + str(stem_cnt) + '_' + list1[9]
document = document.replace('.txt', '.json')
print(newfname)
outfile = open(newfname, 'w', encoding='UTF-8')
list3 = ('{"doc_id": ' + '"' + document + '",' + '"attachment_id\": \"none\",\"pub\": \"Reddit\",\"pub_date\": \"' \
+ list2[1] + '\",\"length\": \"1500\",\"title\": \"' + document + '",' + '"name\": \"' \
+ document + '",' + '\"url": \"Google BigQuery\",\"namespace\": \"we1sv2.0\",\"metapath\": \"Corpus,' \
+ list1[9] + ',' + 'RawData\",\"content\": \"' + line)
list3 = list3.splitlines()[0]
list3 = (list3 + '\"}')
outfile.write(list3)
outfile.close()
line = fp.readline()
fp.close()
The Following two python cells process the three 2017 Reddit torrent files.

2017-Reddit-STEM-File-Processing-for-UCSB-Harbor is located on the 525DH WE1S Google Drive here


'''2017-Reddit-STEM-File-Processing-for-UCSB-Harbor.ipynb Author raysteding@gmail.com
This file adds a comma to the end of each line in the files grep'ped from the 40GB source files for the search term.
For simplicity I add the commas to all lines and then hand-edit the last comma out and add the opening and closing
brackets at the beginning of the file and at the bottom of the file to make a json loadable file for further processing
in the cell below which takes the json file and exports out the Reddit comments.
'''
import json
import re
import os
import glob
path = r'C:\Users\rayst\Documents\525-DH\software\dev\Lexos-cleaned\Humanities-Crisis\the_arts3-crisis.json'
outfilepath = r'C:\Users\rayst\Documents\525-DH\software\dev\Lexos-cleaned\Humanities-Crisis\the_arts3-crisis-edited.json'
outfile = (path)
outfile = open(outfilepath, 'w', encoding='UTF-8')
with open(path, 'r', encoding='UTF-8') as f:
for line in f:
line = line.replace('}','},')
# Strip links from comments
line = re.sub(r'^https?:\/\/.*[\r\n]*', '', line, flags=re.MULTILINE)
print(line)
outfile.write(line)
outfile.close()

'''This file processes the individual search terms from the complete 2017 reddit into .json files.
It requires that the source file from above is in proper json format. It then opens that file and captures the "body"
and the subReddit that each comment comes from as well as the file name. It writes that data out into individual
comments along with the sentiment and subjectivity values produced by textblob into properly formatted json files
which includes the proper metadata according to the WE1S schema'''
import json
import os
import re
from textblob import TextBlob
'''textblob is a program that returns sentiment and subjectivity values. It has to be added to your system. The only
thing that needs to be edited in this file is the line below which calls and processes the file produced in the cell above.
'''
path = r'C:\Users\rayst\Documents\525-DH\software\dev\Lexos-cleaned\Humanities-Crisis\the_arts3-crisis-edited.json'
file_cnt = 0
with open(path, 'r') as f:
num_lines = sum(1 for line in open(path))
doc = json.loads(f.read())
count = 0
sent_total = 0.000
while count ',' ')
line = line.replace('deleted','') ; line = line.replace('removed','')
line = line.replace('ufffd','') ; line = re.sub('\s+', ' ', line)
# Change all the text to lower case
line = line.lower()
line = line.rstrip('\r\n')
content = line

'''Calculate the sentiment polarity'''
blob = TextBlob(content)
sent_total == 0.000
sent_cnt = 1
for sentence in blob.sentences:
sent = ("%.3f" % round(sentence.sentiment.polarity ,3))
sent = float(sent)
sent_total = sent_total + sent
sent_cnt += 1
sent = sent_total / sent_cnt
sent = round(sent, 3)
SP = sent
sent = 0.000
sent_total = 0.000

'''Calculate the subjectivity value'''
sent_total == 0.000
sent_cnt = 1
for sentence in blob.sentences:
sent = ("%.3f" % round(sentence.sentiment.subjectivity ,3))
sent = float(sent)
sent_total = sent_total + sent
sent_cnt += 1
sent = sent_total / sent_cnt
sent = round(sent, 3)
SS = sent
sent = 0.000
sent_total = 0.000

'''This is the beginning of assigning the WE1S schema variables'''
# Capture the path into a list: list1[9] is the file name
# create an output filename
filename2 = f.name
filename2 = filename2.replace('-edited','')
filename2 = filename2.replace('.json','')
filename2 = filename2 + '_' + str(file_cnt) + '.json'
file_title = (os.path.basename(filename2))
date = '2017'
# Capture the word length as char
word_lengths = len(content.split())
word_lengths = str(word_lengths)
# Create a single variable with the sentiment, subjectivity, word lengths, and file title
document = str(subreddit) + '_' + 'sentiment' + '_'+ str(SP) + '_' + 'subjectivity' + '_' + str(SS) \
+ '_' + 'words=' + word_lengths + '_' + file_title
# Put all the data required by the WE1S schema into a variable for the output file
list3 = (
'{' + '\n'
+ ' \"doc_id\": \"' + file_title + '\",' + '\n'
+ ' \"attachment_id\": ' + '\"none' + '\",' + '\n'
+ ' \"pub\": ' + '\"subReddits' + '\",' + '\n'
+ ' \"pub_date\": \"' + date + '\",' + '\n'
+ ' \"length\": \"' + word_lengths + '\",' + '\n'
+ ' \"title\": \"' + document + '\",' + '\n'
+ ' \"content\": \"' + content + '\",' + '\n'
+ ' \"name\": \"' + document + '\",' + '\n'
+ ' \"namespace\": \"we1sv2.0' + '\",' + '\n'
+ ' \"metapath\": \"Corpus' + ',' + file_title + ',' + 'RawData\"' + '\n'
+ '}'
)
list3 = (list3)
count += 1
print(filename2)
outfile = open(filename2, 'w', encoding='UTF-8')
outfile.write(list3)
file_cnt += 1
outfile.close()

No comments:

Post a Comment

A Digital Humanities Study of Reddit Student Discourse about the Humanities

A Digital Humanities Study of Reddit Student Discourse about the Humanities by Raymond Steding Published August 1, 2019 POSTED ...