Hello. I am currently studying Twitter trends ** as the title suggests in my second year student at Komazawa University ** GMS Faculty **. ** This article introduces research, code, and potential references.
Originally, I was interested in Twitter trends, and since I was in the first grade, I have been using Python, Twitter API, and MeCab, but it was ** a primitive thing that morphologically analyzes and aggregates word by word **. Besides, I was playing with language and location information and appearing kanji () ↓ Then, in the same way as N-gram, for example, we recorded every 2-12 vocabulary and aggregated all ** simple trend analysis **. As a side note, I made a lot of manual rules about where the particles don't come and what the auxiliary verbs are, as in the Twitter trend. This is the beginning of the second grade ↓ After that, when it came to what to study, I had the idea of defining and modeling a stationary trend that fluctuates throughout the day, but I was wondering what it would be like. Then, a teacher at a certain university pointed out that the points and rules were in the hands of others. It's about the summer of the second grade ↓ I was wandering a little, but https://pj.ninjal.ac.jp/corpus_center/goihyo.html ** I arrived at the National Institute for Japanese Language and Language Unidic, Classification Vocabulary, Semantic Classification **, and tried to do semantic normalization and clustering of words. It was quite good, but the correspondence table of multiple dictionaries? It was supposed to be a homonym when reverse lookup, but the meaning was referred to by an extreme person, so some people got involved, but the problem is I don't think so. Also, there is an unknown word in Unidic, and there is a problem that the trend is not taken into consideration. Is it about October in the second grade? ↓ When I was able to normalize the meaning, I used CaboCha because I had been thinking from the previous stage that I wanted to use the dependency to create a trend with the dependency of the meaning. ** This was a hassle and I couldn't do it on the seminar server or AWS, so I had a hard time installing it on my Windows and binding it with subprocess (since Python has only 32bit). Since the position to separate Unidic and Cabocha (Can Cabocha also be specified in the dictionary?) Was different, we matched them in the intersection in the way of the least common multiple. ↓ As a result, we were able to create a semantic trend based on the dependency of meaning. ** However, since it says "It should be a homonym, but the meaning is referred to by an extreme person", I made a correction, and that is "Hit in the campaign! There is a method called "follow and tweet", and in "simple trend analysis", there was a method to delete using duplication of expressed trends, but I decided to use that as well. ↓ So far, we have been able to analyze "meaning trends based on the dependency of meaning" and "simple expression trends", and now I think that meaning trends are difficult to separate, unlike expressions. I mean, it's a deep psychology or something big, and I think it's important to make it concrete in this research. Independent? Aggregating by semantic trend accompanying the expression trend, ** estimating the expression trend from the meaning trend, how much both can be used properly, how to combine ** are future issues. Now in November of my second year, I am doing my best for my graduation research. I wrote it because I had a paragraph. (Maybe I should have written it earlier)
Nowadays, the expression trend is a list of characters, and the meaning trend is a list of meanings in which characters are replaced with meanings.
** If you want to use it, please let me know even on Twitter. I was worried about that. Please assume that you own the copyright. ** ** I also wanted to write it in a backup sense. The environment is Windows 10-64bit Python 3 (Miscellaneous) https://twitter.com/kenkensz9
First, it is a program for estimating the expression trend. custam_freq_sentece.txt is the full text obtained for parsing custam_freq_tue.txt is a trend candidate. custam_freq.txt is a trend. custam_freq_new.txt is the program that puts out the longest trend by excluding duplicates from the trend. In addition, trends are changed every hour. The part of freshtime = int (time.time () * 1000)-200000. This should be changed according to the speed of tweet acquisition. Also, there is a word list badword that will not be processed if this word is included in the tweet in the first place, and this may not be the version of the program, but management is poor, so please feel free to use it.
hyousyutu_trend.py
# coding: utf-8
import tweepy
import datetime
import re
import itertools
import collections
from pytz import timezone
import time
import MeCab
#import threading
#from multiprocessing import Pool
import os
#import multiprocessing
import concurrent.futures
import urllib.parse
#import
import pdb; pdb.set_trace()
import gc
import sys
import emoji
consumer_key = ""
consumer_secret = ""
access_token = ""
access_token_secret = ""
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
authapp = tweepy.AppAuthHandler(consumer_key,consumer_secret)
apiapp = tweepy.API(authapp)
#Authentication
m_owaka = MeCab.Tagger("-Owakati")
m_ocha = MeCab.Tagger("-Ochasen")
#mecab morpheme decomposition definition
lang_dict="{'en': 'English', 'und': 'unknown', 'is': 'Icelandic', 'ay': 'Aymara', 'ga': 'Irish', 'az': 'Azerbaigen', 'as': 'Assamese', 'aa': 'Afar', 'ab': 'Aphazian', 'af': 'Afrikaans', 'am': 'Amharic', 'ar': 'Arabic', 'sq': 'Albanian', 'hy': 'Armenian', 'it': 'Italian', 'yi': 'Yiddish', 'iu': 'Inuktitut', 'ik': 'Inupia', 'ia': 'Interlingua', 'ie': 'Interlingue', 'in': 'Indonesian', 'ug': 'Uyghur', 'cy': 'Welsh', 'vo': 'Volapuk', 'wo': 'Wolof', 'uk': 'Ukrainian', 'uz': 'Uzbek', 'ur': 'Urdu', 'et': 'Estonian', 'eo': 'Esperanto', 'or': 'Orian', 'oc': 'Okitan', 'nl': 'Dutch', 'om': 'Oromo', 'kk': 'Kazakh', 'ks': 'Kashmir', 'ca': 'Catalan', 'gl': 'Galician', 'ko': 'Korean', 'kn': 'Kannada', 'km': 'Cambodian', 'rw': 'Kyawanda', 'el': 'Greek language', 'ky': 'Kyrgyz', 'rn': 'Kirundi', 'gn': 'Guarani', 'qu': 'Quetua', 'gu': 'Gujarati', 'kl': 'Greenlandic', 'ku': 'Kurdish', 'ckb': '中央Kurdish', 'hr': 'Croatian', 'gd': 'Gaelic', 'gv': 'Gaelic', 'xh': 'Xhosa', 'co': 'Corsican', 'sm': 'Samoan', 'sg': 'Sangho', 'sa': 'Sanskrit', 'ss': 'Swati', 'jv': 'Javanese', 'ka': 'Georgian', 'sn': 'Shona', 'sd': 'Sindhi', 'si': 'Sinhala', 'sv': 'Swedish', 'su': 'Sudanese', 'zu': 'Zulu', 'es': 'Spanish', 'sk': 'Slovak', 'sl': 'Slovenian', 'sw': 'Swahili', 'tn': 'Setswana', 'st': 'Seto', 'sr': 'Serbian', 'sh': 'セルボCroatian', 'so': 'Somali', 'th': 'Thai', 'tl': 'Tagalog', 'tg': 'Tajik', 'tt': 'Tatar', 'ta': 'Tamil', 'cs': 'Czech language', 'ti': 'Tigrinya', 'bo': 'Tibetan', 'zh': 'Chinese', 'ts': 'Zonga', 'te': 'Telugu', 'da': 'Danish', 'de': 'German', 'tw': 'Twi', 'tk': 'Turkmen', 'tr': 'Turkish', 'to': 'Tongan', 'na': 'Nauruan', 'ja': 'Japanese', 'ne': 'Nepali', 'no': 'Norwegian', 'ht': 'Haitian', 'ha': 'Hausa', 'be': 'White Russian', 'ba': 'Bashkir', 'ps': 'Pasito', 'eu': 'Basque', 'hu': 'Hungarian', 'pa': 'Punjabi', 'bi': 'Bislama', 'bh': 'Bihari', 'my': 'Burmese', 'hi': 'Hindi', 'fj': 'Fijian', 'fi': 'Finnish', 'dz': 'Bhutanese', 'fo': 'Faroese', 'fr': 'French', 'fy': 'Frisian', 'bg': 'Bulgarian', 'br': 'Breton', 'vi': 'Vietnamese', 'iw': 'Hebrew', 'fa': 'Persian', 'bn': 'Bengali', 'pl': 'Polish language', 'pt': 'Portuguese', 'mi': 'Maori', 'mk': 'Macedonian', 'mg': 'Malagasy', 'mr': 'Malata', 'ml': 'Malayalam', 'mt': 'Maltese', 'ms': 'Malay', 'mo': 'Moldavian', 'mn': 'Mongolian', 'yo': 'Yoruba', 'lo': 'Laota', 'la': 'Latin', 'lv': 'Latvian', 'lt': 'Lithuanian', 'ln': 'Lingala', 'li': 'Limburgish', 'ro': 'Romanian', 'rm': 'Rate romance', 'ru': 'Russian'}"
lang_dict=eval(lang_dict)
lang_dict_inv = {v:k for k, v in lang_dict.items()}
#Language dictionary
all=[]
#List initialization
if os.path.exists('custam_freq_tue.txt'):
alll=open("custam_freq_tue.txt","r",encoding="utf-8-sig")
alll=alll.read()
all=eval(alll)
del alll
#all=[]
#Ready to export
#freq_write=open("custam_freq.txt","w",encoding="utf-8-sig")
sent_write=open("custam_freq_sentece.txt","a",encoding="utf-8-sig", errors='ignore')
#Ready to export
use_lang=["Japanese"]
use_type=["tweet"]
#config
uselang=""
for k in use_lang:
k_key=lang_dict_inv[k]
uselang=uselang+" lang:"+k_key
#config preparation
def inita(f,k):
suball=[]
small=[]
for s in k:
if not int(f)==int(s[1]):
#print("------",f)
suball.append(small)
small=[]
#print(s[0],s[1])
small.append(s)
f=s[1]
suball.append(small)
#If 2 is included
return suball
def notwo(al):
micro=[]
final=[]
kaburilist=[]
for fg in al:
kaburilist=[]
if len(fg)>1:
for v in itertools.combinations(fg, 2):
micro=[]
for s in v:
micro.append(s[0])
micro=sorted(micro,key=len,reverse=False)
kaburi=len(set(micro[0]) & set(micro[1]))
per=kaburi*100//len(micro[1])
#print(s[1],per,kaburi,len(micro[0]),len(micro[1]),"m",micro)
if per>50:
kaburilist.append(micro[0])
kaburilist.append(micro[1])
else:
final.append([micro[0],s[1]])
#print("fin1",micro[0],s[1])
if micro[0] in micro[1]:
pass
#print(micro[0],micro[1])
#print("included"*5)
#if micro[0] in kaburilist:
# kaburilist.remove(micro[0])
else:
pass
#print(fg[0][1],fg[0][0])
final.append([fg[0][0],fg[0][1]])
#print("fin3",fg[0][0],fg[0][1])
#if kaburilist:
#longword=max(kaburilist,key=len)
#final.append([longword,s[1]])
##print("fin2",longword,s[1])
#kaburilist.remove(longword)
#kaburilist=list(set(kaburilist))
#for k in kaburilist:
# if k in final:
# final.remove(k)
# #print("finremove1",k)
return final
def siage(fin):
fin=list(map(list, set(map(tuple, fin))))
finallen = sorted(fin, key=lambda x: len(x[0]))
finallendic=dict(finallen)
finalword=[]
for f in finallen:
finalword.append(f[0])
#print("f1",finalword)
notwo=[]
for v in itertools.combinations(finalword, 2):
#print(v)
if v[0] in v[1]:
#print("in")
if v[0] in finalword:
finalword.remove(v[0])
#print("f2",finalword)
finall=[]
for f in finalword:
finall.append([f,finallendic[f]])
finall = sorted(finall, key=lambda x: int(x[1]), reverse=True)
#print("final",finall)
kk=open("custam_freq_new.txt", 'w', errors='ignore')
kk.write(str(finall))
kk.close()
def eval_pattern(use_t):
tw=0
rp=0
rt=0
if "tweet" in use_t:
tw=1
if "retweet" in use_t:
rt=1
if "reply" in use_t:
rp=1
sword=""
if tw==1:
sword="filter:safe OR -filter:safe"
if rp==0:
sword=sword+" exclude:replies"
if rt==0:
sword=sword+" exclude:retweets"
elif tw==0:
if rp==1 and rt ==1:
sword="filter:reply OR filter:retweets"
elif rp==0 and rt ==0:
print("NO")
sys.exit()
elif rt==1:
sword="filter:retweets"
elif rp==1:
sword="filter:replies"
return sword
pat=eval_pattern(use_type)+" "+uselang
#config read function and execution
def a(n):
return n+1
def f(k):
k = list(map(a, k))
return k
def g(n,m):
b=[]
for _ in range(n):
m=f(m)
b.append(m)
return b
#Serial number list generation
def validate(text):
if re.search(r'(.)\1{1,}', text):
return False
elif re.search(r'(..)\1{1,}', text):
return False
elif re.search(r'(...)\1{1,}', text):
return False
elif re.search(r'(...)\1{1,}', text):
return False
elif re.search(r'(....)\1{1,}', text):
return False
else:
return True
#Function to check for duplicates
def eval_what_nosp(c,i):
no_term=[]
no_start=[]
no_in=[]
koyu_meisi=[]
if re.findall(r"[「」、。)(『』&@_;【/<>,!】\/@]", c[0]):
no_term.append(i)
no_start.append(i)
no_in.append(i)
if len(c) == 4:
if "suffix" in c[3]:
no_start.append(i)
if "Proper noun" in c[3]:
koyu_meisi.append(i)
if c[3]=="noun-Non-independent-General":
no_term.append(i)
no_start.append(i)
no_in.append(i)
if "Particle" in c[3]:
no_term.append(i)
no_start.append(i)
#no_in.append(i)
if c[3]=="Particle-Attributive":
no_start.append(i)
if c[3]=="Particle":
no_start.append(i)
if "O" in c[2]:
if c[3]=="noun-Change connection":
no_term.append(i)
no_start.append(i)
no_in.append(i)
if len(c) == 6:
if c[4]=="Sahen Suru":
no_start.append(i)
if c[3]=="verb-Non-independent":
no_start.append(i)
if "suffix" in c[3]:
no_start.append(i)
if c[3]=="Auxiliary verb":
if c[2]=="Ta":
no_start.append(i)
no_in.append(i)
if c[3]=="Auxiliary verb":
if c[2]=="Absent":
no_start.append(i)
if c[3]=="Auxiliary verb":
if "Continuous use" in c[5]:
no_term.append(i)
no_start.append(i)
if c[2]=="To do":
if c[3]=="verb-Independence":
if c[5]=="Continuous form":
no_start.append(i)
no_in.append(i)
if c[2]=="Become":
if c[3]=="verb-Independence":
no_start.append(i)
no_in.append(i)
if c[2]=="Teru":
if c[3]=="verb-Non-independent":
no_start.append(i)
no_in.append(i)
if c[2]=="is":
if c[3]=="Auxiliary verb":
no_start.append(i)
no_in.append(i)
if c[2]=="Chau":
if c[3]=="verb-Non-independent":
no_start.append(i)
no_in.append(i)
if c[2]=="is there":
if c[3]=="verb-Independence":
no_term.append(i)
no_start.append(i)
no_in.append(i)
if c[2]=="Auxiliary verb":
if c[3]=="Special da":
no_term.append(i)
no_start.append(i)
no_in.append(i)
if c[2]=="Trout":
if c[3]=="Auxiliary verb":
no_term.append(i)
no_start.append(i)
no_in.append(i)
if "Continuous use" in c[5]:
no_term.append(i)
if c[5]=="Word connection":
no_start.append(i)
if c[2]=="Give me":
if c[3]=="verb-Non-independent":
no_start.append(i)
no_in.append(i)
x=""
y=""
z=""
koyu=""
if no_term:
x=no_term[0]
if no_start:
y=no_start[0]
if no_in:
z=no_in[0]
if koyu_meisi:
koyu=koyu_meisi[0]
#print("koyu",koyu)
koyu=int(koyu)
return x,y,z,koyu
small=[]
nodouble=[]
seq=""
def process(ty,tw,un,tagg):
global all
global seq
global small
global nodouble
tw=tw.replace("\n"," ")
sent_write.write(str(tw))
sent_write.write("\n")
parselist=m_owaka.parse(tw)
parsesplit=parselist.split()
parseocha=m_ocha.parse(tw)
l = [x.strip() for x in parseocha[0:len(parseocha)-5].split('\n')]
nodouble=[]
no_term=[]
no_start=[]
no_in=[]
km_l=[]
for i, block in enumerate(l):
c=block.split('\t')
#sent_write.write("\n")
#sent_write.write(str(c))
#sent_write.write("\n")
#print(str(c))
ha,hi,hu,km=eval_what_nosp(c,i)
no_term.append(ha)
no_start.append(hi)
no_in.append(hu)
km_l.append(km)
#Completed writing
if km_l[0]:
for r in km_l:
strin=parsesplit[r]
if not strin in nodouble:
all.append([strin,un])
nodouble.append(strin)
for s in range(2,8):
#A chain of 2 to 8.
#Important because you can improve accuracy instead of making it heavier
num=g(len(parsesplit)-s+1,range(-1,s-1))
for nr in num:
#2 for one sentence-All streets of 8 chains
#print(no_term)
if not len(set(nr) & set(no_in)):
if not nr[-1] in no_term:
if not nr[0] in no_start:
small=[]
#print(str(parsesplit))
for nr2 in nr:
#print(nr2,parsesplit[nr2])
#Add word to small at the position indexed by the array inside
small.append(parsesplit[nr2])
seq="".join(small)
judge_whole=0
bad_direct_word=["Like","\'mat","I\'mat"]
#if "" in seq:
# judge_whole=1
#if "" in seq:
# judge_whole=1
for bd in bad_direct_word:
if seq==bd:
judge_whole=1
break
parselist=m_owaka.parse(seq)
l = [x.strip() for x in parseocha[0:len(parseocha)-5].split('\n')]
for n in range(len(l)):
if len(l[n].split("\t"))==6:
if l[n].split("\t")[3]=="verb-Independence":
if len(l[n+1].split("\t"))==6:
if l[n+1].split("\t")[3]:
judge_whole=1
break
if judge_whole==0:
if validate(seq) and len(seq) > 3 and not re.findall(r'[「」、。『』/\\/@]', seq):
if not seq in nodouble:
#Continuous avoidance
all.append([seq,un])
nodouble.append(seq)
#print("Added successfully",seq)
#Do not aggregate the same word twice
else:
#print("Already included",seq)
pass
else:
#print("Exclusion",seq)
pass
else:
#print("The beginning is no_is start",seq)
pass
else:
#print("The end is no_term",seq)
pass
#print("\n")
#print(parsesplit)
#print(l)
if tagg:
print("tagg",tagg)
for sta in tagg:
all.append(["#"+str(sta),un])
#Include tag
N=1
#Number of tweets acquired
def print_varsize():
import types
print("{}{: >15}{}{: >10}{}".format('|','Variable Name','|',' Size','|'))
print(" -------------------------- ")
for k, v in globals().items():
if hasattr(v, 'size') and not k.startswith('_') and not isinstance(v,types.ModuleType):
print("{}{: >15}{}{: >10}{}".format('|',k,'|',str(v.size),'|'))
elif hasattr(v, '__len__') and not k.startswith('_') and not isinstance(v,types.ModuleType):
print("{}{: >15}{}{: >10}{}".format('|',k,'|',str(len(v)),'|'))
def collect_count():
global all
global deadline
hh=[]
tueall=[]
#print("alllll",all)
freshtime=int(time.time()*1000)-200000
deadline=-1
#import pdb; pdb.set_trace()
#print(N_time)
print(len(N_time))
for b in N_time:
if int(b[1]) < freshtime:
deadline=b[0]
print("dead",deadline)
dellist=[]
if not deadline ==-1:
for b in N_time:
print("b",b)
if int(b[0]) < int(deadline):
dellist.append(b)
for d in dellist:
N_time.remove(d)
#print(N_time)
#import pdb; pdb.set_trace()
#time.sleep(2)
#import pdb; pdb.set_trace()
for a in all:
if int(a[1]) > freshtime:
#Number of tweets you want to get/45*Subtract the value of 1000. Now 5000/45*1000=112000
tueall.append(a[0])
#print("tuealllappend"*10)
#print(tueall)
else:
all.remove(a)
#print("allremove",a)
#import pdb; pdb.set_trace()
c = collections.Counter(tueall)
c=c.most_common()
#print("c",c)
#print(c)
for r in c:
if r and r[1]>1:
hh.append([str(r[0]),str(r[1])])
k=str(hh).replace("[]","")
freq_write=open("custam_freq.txt","w",encoding="utf-8-sig", errors='ignore')
freq_write.write(str(k))
#import pdb; pdb.set_trace()
oldunix=N_time[0][1]
newunix=N_time[-1][1]
dato=str(datetime.datetime.fromtimestamp(oldunix/1000)).replace(":","-")
datn=str(datetime.datetime.fromtimestamp(newunix/1000)).replace(":","-")
dato=dato.replace(" ","_")
datn=datn.replace(" ","_")
#print(dato,datn)
#import pdb; pdb.set_trace()
freq_writea=open("trenddata/custam_freq-"+dato+"-"+datn+"--"+str(len(N_time))+".txt","w",encoding="utf-8-sig", errors='ignore')
freq_writea.write(str(k))
#import pdb; pdb.set_trace()
freq_write_tue=open("custam_freq_tue.txt","w",encoding="utf-8-sig", errors='ignore')
freq_write_tue.write(str(all))
#print(c)
def remove_emoji(src_str):
return ''.join(c for c in src_str if c not in emoji.UNICODE_EMOJI)
def deEmojify(inputString):
return inputString.encode('ascii', 'ignore').decode('ascii')
def get_tag(tw,text_content):
taglist=[]
entities=eval(str(tw.entities))["hashtags"]
for e in entities:
text=e["text"]
taglist.append(text)
for _ in range(len(taglist)+2):
for s in taglist:
text_content=re.sub(s,"",text_content)
#text_content=re.sub(r"#(.+?)+ ","",text_content)
return taglist,text_content
def get_time(id):
two_raw=format(int(id),'016b').zfill(64)
unixtime = int(two_raw[:-22],2) + 1288834974657
unixtime_th = datetime.datetime.fromtimestamp(unixtime/1000)
tim = str(unixtime_th).replace(" ","_")[:-3]
return tim,unixtime
non_bmp_map = dict.fromkeys(range(0x10000, sys.maxunicode + 1), '')
N_time=[]
def gather(tweet,type,tweet_type,removed_text):
global N
global N_all
global lagtime
global all_time
global all
global auth
global N_time
if get_time(tweet.id):
tim,unix=get_time(tweet.id)
else:
exit
#Get detailed tweet time
#original_text=tweet.text
nowtime=time.time()
tweet_pertime=str(round(N/(nowtime-all_time),1))
lag=str(round(nowtime-unix/1000,1))
#Calculate lag
lang=lang_dict[tweet.lang]
print(N_all,N,tweet_pertime,"/s","+"+lag,tim,type,tweet_type,lang)
#Information display.(All tweets, processed tweets, processing speed, lag, real time, acquisition route, tweet type, language)
print(removed_text.replace("\n"," "))
taglist,tag_removed_text=get_tag(tweet,removed_text)
#import pdb; pdb.set_trace()
#print(type(tweet))
#import pdb; pdb.set_trace()
#Exclude tags
noemoji=remove_emoji(tag_removed_text)
try:
process(tweet_type,tag_removed_text,unix,taglist)
N_time.append([N,unix])
print("trt",tag_removed_text)
except Exception as pe:
print("process error")
print(pe)
#import pdb; pdb.set_trace()
#Send to actual processing
surplus=N%1000
if surplus==0:
#sumprocess()
try:
collect_count()
except Exception as eeee:
print(eeee)
#exit
#Let's count
cft_read=open("custam_freq.txt","r",encoding="utf-8-sig")
cft_read=cft_read.read()
cft_read=eval(cft_read)
max_freq=cft_read[0][1]
#Maximum value
allen=inita(max_freq,cft_read)
#Make a list of trends with the same frequency.
finf=notwo(allen)
#Find and remove duplicate strings and trends
siage(finf)
#New it_Write as freq
print_varsize()
#Display memory information
N=N+1
#streaming body
def judge_tweet_type(tweet):
text = re.sub("https?://[\w/:%#\$&\?\(\)~\.=\+\-]+","",tweet.text)
if tweet.in_reply_to_status_id_str :
text=re.sub(r"@[a-zA-Z0-9_]* ","",text)
text=re.sub(r"@[a-zA-Z0-9_]","",text)
return "reply",text
else:
head= str(tweet.text).split(":")
if len(head) >= 2 and "RT" in head[0]:
text=re.sub(r"RT @[a-zA-Z0-9_]*: ","",text)
return "retwe",text
else:
return "tweet",text
badword=["Question box","Let's throw marshmallows","I get a question","Participation in the war","Delivery","@","Follow","Application","Smartphone RPG","Gacha","S4live","campaign","Drift spirits","Present","Cooperative live","We are accepting consultations completely free of charge","Omikuji","Chance to win","GET","get","shindanmaker","Hit","lottery"]
N_all=0
def gather_pre(tweet,type):
global N_all
N_all=N_all+1
#Count all tweets passing through here
go=0
for b in badword:
if b in tweet.text:
go=1
break
#Judge whether there is a bad word in the text, GO judgment because it is not included in 0
if go == 0:
if tweet.lang=="ja":
tweet_type,removed_text=judge_tweet_type(tweet)
#Determine tweet type
if tweet_type=="tweet":
try:
gather(tweet,type,tweet_type,removed_text)
#print(type(tweet))
#Send to gather processing.
except Exception as eee:
#gather("Ah","Ah","Ah","Ah")
#import pdb; pdb.set_trace()
pass
lagtime=0
def search(last_id):
#print(pat)
global pat
time_search =time.time()
for status in apiapp.search(q=pat,count="100",result_type="recent",since_id=last_id):
#Get newer tweets than the last tweet you got with search
gather_pre(status,"search")
#search body
interval = 2.16
#search call interval
#min2
trysearch=0
#search number of calls
class StreamingListener(tweepy.StreamListener):
def on_status(self, status):
global time_search
global trysearch
gather_pre(status,"stream")
time_stream=time.time()
time_stream-time_search % interval
if time_stream-time_search-interval>interval*trysearch:
#For a certain period of time(interbal)Execute search every time.
last_id=status.id
#executor = concurrent.futures.ThreadPoolExecutor(max_workers=8)
#executor.submit(search(last_id))
#When trying parallel processing
search(last_id)
trysearch=trysearch+1
#streaming body
def carry():
listener = StreamingListener()
streaming = tweepy.Stream(auth, listener)
streaming.sample()
#stream call function
time_search =time.time()
#The time when the search was last executed, but defined before the stream
executor = concurrent.futures.ThreadPoolExecutor(max_workers=8)
#Parallel definition
all_time=time.time()
#Execution start time definition
try:
carry()
except Exception as e:
import pdb; pdb.set_trace()
print(e)
#import pdb; pdb.set_trace()
pass
#except Exception as ee:
#print(ee)
#import pdb; pdb.set_trace()
#carry body and error handling
Below is a program of meaning trends, but I can recommend it with confidence because it is being highly acclaimed (). The part I bring from the thesaurus is not something I wrote, but I will leave it.
imi_trend.py
from bs4 import BeautifulSoup
import collections
import concurrent.futures
import datetime
import emoji
import itertools
import MeCab
from nltk import Tree
import os
from pathlib import Path
from pytz import timezone
import re
import spacy
import subprocess
import sys
import time
import tweepy
import unidic2ud
import unidic2ud.cabocha as CaboCha
from urllib.error import HTTPError, URLError
from urllib.parse import quote_plus
from urllib.request import urlopen
m=MeCab.Tagger("-d ./unidic-cwj-2.3.0")
os.remove("bunrui01.csv")
os.remove("all_tweet_text.txt")
os.remove("all_kakari_imi.txt")
bunrui01open=open("bunrui01.csv","a",encoding="utf-8")
textopen=open("all_tweet_text.txt","a",encoding="utf-8")
akiopen=open("all_kakari_imi.txt","a",encoding="utf-8")
catedic={}
with open('categori.txt') as f:
a=f.read()
aa=a.split("\n")
b=[]
bunrui01open.write(",,,")
for i, j in enumerate(aa):
catedic[j]=i
bunrui01open.write(str(j))
bunrui01open.write(",")
bunrui01open.write("\n")
print(catedic)
with open('./BunruiNo_LemmaID_ansi_user.csv') as f:
a=f.read()
aa=a.split(",\n")
b=[]
for bb in aa:
if len(bb.split(","))==2:
b.append(bb.split(","))
word_origin_num_to_cate=dict(b)
with open('./cate_rank2.csv') as f:
a=f.read()
aa=a.split("\n")
b=[]
for bb in aa:
if len(bb.split(","))==2:
b.append(bb.split(","))
cate_rank=dict(b)
class Synonym:
def getSy(self, word, target_url, css_selector):
try:
#Encoded because the URL to access contains Japanese
self.__url = target_url + quote_plus(word, encoding='utf-8')
#Access and parse
self.__html = urlopen(self.__url)
self.__soup = BeautifulSoup(self.__html, "lxml")
result = self.__soup.select_one(css_selector).text
return result
except HTTPError as e:
print(e.reason)
except URLError as e:
print(e.reason)
sy = Synonym()
alist = ["Selection"]
#Use "Japanese Thesaurus Associative Thesaurus" to search
target = "https://renso-ruigo.com/word/"
selector = "#content > div.word_t_field > div"
#for item in alist:
# print(sy.getSy(item, target, selector))
consumer_key = ""
consumer_secret = ""
access_token = ""
access_token_secret = ""
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
authapp = tweepy.AppAuthHandler(consumer_key,consumer_secret)
apiapp = tweepy.API(authapp)
#Authentication(Here api)
def remove_emoji(src_str):
return ''.join(c for c in src_str if c not in emoji.UNICODE_EMOJI)
def get_tag(tw,text_content):
taglist=[]
entities=eval(str(tw.entities))["hashtags"]
for e in entities:
text=e["text"]
taglist.append(text)
for _ in range(len(taglist)+2):
for s in taglist:
text_content=re.sub(s,"",text_content)
#text_content=re.sub(r"#(.+?)+ ","",text_content)
return taglist,text_content
def get_swap_dict(d):
return {v: k for k, v in d.items()}
def xcut(asub,a):
asub.append(a[0])
a=a[1:len(a)]
return asub,a
def ycut(asub,a):
asub.append(a[0])
a=a[1:len(a)]
return asub,a
def bunruikugiri(lastx,lasty):
hoge=[]
#import pdb; pdb.set_trace()
editx=[]
edity=[]
for _ in range(500):
edity,lasty=ycut(edity,lasty)
#target=sum(edity)
for _ in range(500):
target=sum(edity)
#rint("sum",sum(editx),"target",target)
if sum(editx)<target:
editx,lastx=xcut(editx,lastx)
elif sum(editx)>target:
edity,lasty=ycut(edity,lasty)
else:
hoge.append(editx)
editx=[]
edity=[]
if lastx==[] and lasty==[]:
return hoge
break
all_appear_cate=[]
all_unfound_word=[]
all_kumiawase=[]
nn=1
all_kakari_imi=[]
def process(tw,ty):
global nn
wordnum_toword={}
catenum_wordnum={}
word_origin_num=[]
mozisu=[]
try:
tw=re.sub("https?://[\w/:%#\$&\?\(\)~\.=\+\-]+","",tw)
tw=tw.replace("#","")
tw=tw.replace(",","")
tw=tw.replace("\u3000","") #Important for matching the number of characters
tw=re.sub(re.compile("[!-/:-@[-`{-~]"), '', tw)
parseocha=m.parse(tw)
print(tw)
l = [x.strip() for x in parseocha[0:len(parseocha)-5].split('\n')]
bunrui_miti_sentence=[]
for i, block in enumerate(l):
if len(block.split('\t')) > 1:
c=block.split('\t')
d=c[1].split(",")
#Word processing process
print(d,len(d))
if len(d)>9:
if d[10] in ["To do"]:
word_origin_num.append(d[10])
bunrui_miti_sentence.append(d[8])
mozisu.append(len(d[8]))
elif d[-1] in word_origin_num_to_cate:
word_origin_num.append(int(d[-1]))
wordnum_toword[int(d[-1])]=d[8]
bunrui_miti_sentence.append(word_origin_num_to_cate[str(d[-1])])
mozisu.append(len(d[8]))
else:
#print("nai",d[8])
#Display of unknown words
all_unfound_word.append(d[10])
bunrui_miti_sentence.append(d[8])
mozisu.append(len(c[0]))
else:
mozisu.append(len(c[0]))
all_unfound_word.append(c[0])
bunrui_miti_sentence.append(c[0])
#else:
# mozisu.append(l[])
#print("kouho",word_origin_num,"\n")
#Words to original numbers
#print(tw)
#If you look at sentences made with semantic classification and unknown words
for s in bunrui_miti_sentence:
print(s," ",end="")
print("\n")
stn=0
cmd = "echo "+str(tw)+" | cabocha -f1"
cmdtree="echo "+str(tw)+" | cabocha "
proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE,shell=True)
proctree = subprocess.Popen(cmdtree, stdout=subprocess.PIPE, stderr=subprocess.PIPE,shell=True)
proc=proc.communicate()[0].decode('cp932')
proctree=proctree.communicate()[0].decode('cp932')
print(proctree)
proclist=proc.split("\n")
#print(proc)
#f1 information
#print(proclist)
#Listing information
procnumlist=[]
wordlis=[]
eachword=""
num=0
for p in proclist:
if p[0]=="*":
f=p.split(" ")[1]
t=p.split(" ")[2].replace("D","")
procnumlist.append([f,t])
if eachword:
wordlis.append([num,eachword])
num=num+1
eachword=""
elif p=="EOS\r":
wordlis.append([num,eachword])
num=num+1
eachword=""
break
else:
#print("aaaaa",p.split("\t")[0])
eachword=eachword+p.split("\t")[0]
tunagari_num_dict=dict(procnumlist)
print(tunagari_num_dict)
bunsetu_num_word=dict(wordlis)
#print(bunsetu_num_word)
bunsetu_mozisu=[]
for v in bunsetu_num_word.values():
bunsetu_mozisu.append(len(v))
if sum(bunsetu_mozisu) != sum(mozisu):
return
#print("mozisu",mozisu)
#print("bunsetumozi",bunsetu_mozisu)
res=bunruikugiri(mozisu,bunsetu_mozisu)
#print("res",res)
nnn=0
small_cateandcharlist=[]
big_cateandcharlist=[]
for gc in res:
for _ in range(len(gc)):
print(bunrui_miti_sentence[nnn],end=" ")
if bunrui_miti_sentence[nnn] in list(catedic.keys()):
small_cateandcharlist.append(bunrui_miti_sentence[nnn])
nnn=nnn+1
#Unknown words and particles are considered to be the same, so the mecabne gold dictionary can be used.
if small_cateandcharlist==[]:
big_cateandcharlist.append(["null"])
else:
big_cateandcharlist.append(small_cateandcharlist)
small_cateandcharlist=[]
print("\n")
#print("bcacl",big_cateandcharlist)
twewtnai_kakari_imi=[]
if len(big_cateandcharlist)>1 and len(big_cateandcharlist)==len(bunsetu_num_word):
#Dependencies and morphological analysis delimiters do not match
for kk, vv in tunagari_num_dict.items():
if vv != "-1":
for aaw in big_cateandcharlist[int(kk)]:
for bbw in big_cateandcharlist[int(vv)]:
twewtnai_kakari_imi.append([aaw,bbw])
if not "Rank symbol" in str([aaw,bbw]):
if not "null" in str([aaw,bbw]):
if not "Number sign" in str([aaw,bbw]):
if not "Things" in str([aaw,bbw]):
all_kakari_imi.append(str([aaw,bbw]))
akiopen.write(str([aaw,bbw]))
else:
break
else:
return
akiopen.write("\n")
akiopen.write(str(bunrui_miti_sentence))
akiopen.write("\n")
akiopen.write(str(tw))
akiopen.write("\n")
print("tki",twewtnai_kakari_imi)
tweetnai_cate=[]
word_cate_num=[]
for k in word_origin_num:
if str(k) in word_origin_num_to_cate:
ram=word_origin_num_to_cate[str(k)]
print(ram,cate_rank[ram],end="")
tweetnai_cate.append(ram)
all_appear_cate.append(ram)
word_cate_num.append(catedic[ram])
catenum_wordnum[catedic[ram]]=int(k)
stn=stn+1
else:
if k in ["To do"]:
all_appear_cate.append(k)
tweetnai_cate.append(k)
print("\n")
#print(tweetnai_cate)
#import pdb; pdb.set_trace()
for k in tweetnai_cate:
if k in catedic:
aac=catedic[k]
#print("gyaku",word_cate_num)
#print("wt",wordnum_toword)
#print("cw",catenum_wordnum)
bunrui01open.write(str(tw))
bunrui01open.write(",")
bunrui01open.write(str(tim))
bunrui01open.write(",")
bunrui01open.write(str(unix))
bunrui01open.write(",")
ps=0
for tt in list(range(544)):
if int(tt) in word_cate_num:
a=catenum_wordnum[tt]
#Word number from the sword
bunrui01open.write(str(wordnum_toword[a]))
#Word from word number
bunrui01open.write(",")
ps=ps+1
else:
bunrui01open.write("0,")
bunrui01open.write("end")
bunrui01open.write("\n")
textopen.write(str(nn))
textopen.write(" ")
textopen.write(tw)
textopen.write("\n")
nn=nn+1
#Put all the streets
for k in list(itertools.combinations(tweetnai_cate,2)):
all_kumiawase.append(k)
except Exception as ee:
print(ee)
import pdb; pdb.set_trace()
pass
def judge_tweet_type(tweet):
if tweet.in_reply_to_status_id_str:
return "reply"
else:
head= str(tweet.text).split(":")
if len(head) >= 2 and "RT" in head[0]:
return "retwe"
else:
return "tweet"
#Judging whether it is a rip, retweet, or tweet
def get_time(id):
two_raw=format(int(id),'016b').zfill(64)
unixtime = int(two_raw[:-22],2) + 1288834974657
unixtime_th = datetime.datetime.fromtimestamp(unixtime/1000)
tim = str(unixtime_th).replace(" ","_")[:-3]
return tim,unixtime
#Tweet time from id
N=1
def gather(tweet,type,tweet_typea):
global all_appear_cate
global N
global all_time
global tim
global unix
tim,unix=get_time(tweet.id)
original_text=tweet.text.replace("\n","")
taglist,original_text=get_tag(tweet,original_text)
nowtime=time.time()
tweet_pertime=str(round(N/(nowtime-all_time),1))
lag=str(round(nowtime-unix/1000,1))
#lang=lang_dict[tweet.lang]
try:
process(remove_emoji(original_text),tweet_typea,)
except Exception as e:
print(e)
#import pdb; pdb.set_trace()
pass
print(N,tweet_pertime,"/s","+"+lag,tim,type,tweet_typea)
N=N+1
if N%500==0:
ccdd=collections.Counter(all_appear_cate).most_common()
for a in ccdd:
print(a)
#ccdd=collections.Counter(all_unfound_word).most_common()
#for a in ccdd:
# print("Absent",a)
ccdd=collections.Counter(all_kumiawase).most_common(300)
for a in ccdd:
print(a)
ccdd=collections.Counter(all_kakari_imi).most_common(300)
for a in ccdd:
print("all_kakari_imi",a)
#import pdb; pdb.set_trace()
#All tweets of stream and search are collected
def pre_gather(tw,ty):
#print(ty)
# if "http://utabami.com/TodaysTwitterLife" in tw.text:
print(tw.text)
if ty=="stream":
tweet_type=judge_tweet_type(tw)
if tw.lang=="ja" and tweet_type=="tweet":
gather(tw,ty,tweet_type)
elif ty=="search":
gather(tw,ty,"tweet")
def search(last_id):
time_search =time.time()
for status in apiapp.search(q="filter:safe OR -filter:safe -filter:retweets -filter:replies lang:ja",count="100",result_type="recent",since_id=last_id):
pre_gather(status,"search")
#search body
class StreamingListener(tweepy.StreamListener):
def on_status(self, status):
global time_search
global trysearch
pre_gather(status,"stream")
time_stream=time.time()
time_stream-time_search % interval
if time_stream-time_search-interval>interval*trysearch:
last_id=status.id
#executor = concurrent.futures.ThreadPoolExecutor(max_workers=2)
#executor.submit(search(last_id))
search(last_id)
trysearch=trysearch+1
#streaming body
def carry():
listener = StreamingListener()
streaming = tweepy.Stream(auth, listener)
streaming.sample()
interval = 2.1
trysearch=0
time_search =time.time()
#executor = concurrent.futures.ThreadPoolExecutor(max_workers=2)
all_time=time.time()
try:
#executor.submit(carry)
carry()
except Exception as er:
print(er)
import pdb; pdb.set_trace()
pass
For every 500 tweets Number of occurrences of simple meaning Number of occurrences of 2-gram meaning The number of occurrences of a meaningful continuous? Also for all Tweet text, Unidic analysis information, CaboCha dependency, replacement with semantic classification, etc.
bunrui01.csv-The horizontal axis is the meaning classification of 544, the vertical axis is the tweet, 0 does not exist, 1 is the csv to write so that it is the corresponding word all_tweet_text-Processed tweets and what number they are all_kakari_imi-Dependent meaning pair, meaning classification replacement, text categori.txt-A txt that describes the semantic classification of 544 and creates a dictionary with index at runtime. For more information on BunruiNo_LemmaID_ansi_user.csv, see https://pj.ninjal.ac.jp/corpus_center/goihyo.html As you can see, it is a correspondence table of word original numbers and meaning classifications. cate_rank2.csv-A dictionary of the order of appearance of semantic classifications created at one time.
I will explain the other variables later,
It's for my own memo, and people who understand it will do their best to understand it, so I will do this.
Recommended Posts