regex.spl
| makeresults
| eval text="THE JAPANESE SCHOOL EDUCATION In Japan, education system is changing fast now."
| rex field=text "(?ix)((?P<big_japan>(?P<japan>Japan).*?(?P=japan))) #From Japan to japan"
I was able to use group matches with Splunk as well. I'm posting it somewhere else, but I practiced because I couldn't do re
too much.
re official is very easy to understand.
re
sample.txt
"THE JAPANESE SCHOOL EDUCATION In Japan, education system is changing fast now. The content of education is reduced and students come to have free time more. Furthermore, 'total education time' is taken in all Japanese junior high school. I think this change is bad and Japanese government must change it to original form rapidly for the following reasons. Firstly, many young people this time cannot read or write basic words (Japanese 'kanji.') And, they cannot calculate, too. These things are need in daily life, even if they don't go to college or university. Originally, Japanese student got better score in reading and calculation than any other country's student few decades ago. For, reading, writing, and calculation were very important in Japanese society. Now, however, this good value in old Japan is being reduced. This is very large problem in Japan. Secondly, there is deep gap between the level of high school education and university education. Many students who don't learn the content of high school education cannot catch up with the class in universities. Furthermore, for example, I am medical student, but I don't learn biology in high school. And there are many students like me. In addition, the care of university to us is nearly nothing. So, the level of the study in technology, medicine and so is going down. This is very large problem in Japan, too. Thirdly, as the content of school education is reduced, at the same time, the curiosity of students seems reduced. The new idea and new device are coming from the curiosity, I think. So, the reduction of it means the down of possibility that the evolutional change in various field will happen. This is very large problem in Japan. In conclusion, there are problems like these in Japan, because of the reduction of basic education. Luckily, the Japanese government is planning to change the education system. I hope this change will be going back to old Japanese school education system. \n"
https://www.f.waseda.jp/yusukekondo/TALL19/TALL_Spring03.html Quoted from
search
Use search
because match
matches only from the beginning (^ keyword
).
search.py
import re
m=re.compile(r"""
\b(?P<sentence>.*?[.]) #Try to extract with sentences
""",re.X)
result=m.search(text)
print(result)
Since it is in English, try separating it with .
.
<_sre.SRE_Match object; span=(0, 78), match='THE JAPANESE SCHOOL EDUCATION In Japan, education>
Since len ('"THE JAPANESE SCHOOL EDUCATION In Japan, education system is changing fast now.')
Is 79, it matches.
Why doesn't the description of Match object appear in Python3: thinking:
SRE_Match object#
getattr.py
import re
m=re.compile(r"""
\b(?P<sentence>.*?[.]) #Try to extract with sentences
""",re.X)
result=m.search(text)
for i in dir(result):
if not i.startswith('__'):
print(f'{i}: {getattr(result,i)}')
I'm not sure what the * Match Object * is, so let's check the method.
result
end: <built-in method end of _sre.SRE_Match object at 0x7fe65d3ba198>
endpos: 1969
expand: <built-in method expand of _sre.SRE_Match object at 0x7fe65d3ba198>
group: <built-in method group of _sre.SRE_Match object at 0x7fe65d3ba198>
groupdict: <built-in method groupdict of _sre.SRE_Match object at 0x7fe65d3ba198>
groups: <built-in method groups of _sre.SRE_Match object at 0x7fe65d3ba198>
lastgroup: sentence
lastindex: 1
pos: 0
re: re.compile('\n\\b(?P<sentence>.*?[.]) #Try to extract with sentences\n', re.VERBOSE)
regs: ((0, 78), (0, 78))
span: <built-in method span of _sre.SRE_Match object at 0x7fe65d3ba198>
start: <built-in method start of _sre.SRE_Match object at 0x7fe65d3ba198>
string: THE JAPANESE SCHOOL EDUCATION In Japan, education system is changing fast now. The (...abridgement)
As per Python2.7 Match Object. Where is the Python3 guy? Thinking:
findall
findall.py
m=re.compile(r"""
\b(?P<sentence>.*?[.]) #Try to extract with sentences
""",re.X)
result=m.findall(text) #There is only one search, but all findall
print(type(result))
print('-'*10)
for i in dir(result):
if not i.startswith('__'):
print(f'{i}: {getattr(result,i)}')
print('-'*10)
for i in result:
print(i) #Since the result is a list, expand one by one
If you want to get all the matches, findall
result
<class 'list'>
----------
append: <built-in method append of list object at 0x7fe65d2dca48>
clear: <built-in method clear of list object at 0x7fe65d2dca48>
copy: <built-in method copy of list object at 0x7fe65d2dca48>
count: <built-in method count of list object at 0x7fe65d2dca48>
extend: <built-in method extend of list object at 0x7fe65d2dca48>
index: <built-in method index of list object at 0x7fe65d2dca48>
insert: <built-in method insert of list object at 0x7fe65d2dca48>
pop: <built-in method pop of list object at 0x7fe65d2dca48>
remove: <built-in method remove of list object at 0x7fe65d2dca48>
reverse: <built-in method reverse of list object at 0x7fe65d2dca48>
sort: <built-in method sort of list object at 0x7fe65d2dca48>
----------
THE JAPANESE SCHOOL EDUCATION In Japan, education system is changing fast now.
The content of education is reduced and students come to have free time more.
Furthermore, 'total education time' is taken in all Japanese junior high school.
I think this change is bad and Japanese government must change it to original form rapidly for the following reasons.
Firstly, many young people this time cannot read or write basic words (Japanese 'kanji.
And, they cannot calculate, too.
These things are need in daily life, even if they don't go to college or university.
Originally, Japanese student got better score in reading and calculation than any other country's student few decades ago.
For, reading, writing, and calculation were very important in Japanese society.
Now, however, this good value in old Japan is being reduced.
This is very large problem in Japan.
Secondly, there is deep gap between the level of high school education and university education.
Many students who don't learn the content of high school education cannot catch up with the class in universities.
Furthermore, for example, I am medical student, but I don't learn biology in high school.
And there are many students like me.
In addition, the care of university to us is nearly nothing.
So, the level of the study in technology, medicine and so is going down.
This is very large problem in Japan, too.
Thirdly, as the content of school education is reduced, at the same time, the curiosity of students seems reduced.
The new idea and new device are coming from the curiosity, I think.
So, the reduction of it means the down of possibility that the evolutional change in various field will happen.
This is very large problem in Japan.
In conclusion, there are problems like these in Japan, because of the reduction of basic education.
Luckily, the Japanese government is planning to change the education system.
I hope this change will be going back to old Japanese school education system.
The result is a list
split
split.py
result1=re.split('(?<=\.)\s',text) #I tried to include the delimiter with split.
print(type(result1))
print('-'*10)
m2=re.compile(r"""
(?P<japan_txt>japan.*?)\b #test
""",re.VERBOSE|re.IGNORECASE)
{i:[v,re.search(m2,v).group()] for i,v in enumerate(result1) if re.search(m2,v)}
I thought that split ()
would be enough to separate sentences.
I wanted to keep the delimiter as .
, so I separated it with the (space)
after that.
result
<class 'list'>
----------
{0: ['THE JAPANESE SCHOOL EDUCATION In Japan, education system is changing fast now.',
'JAPANESE'],
2: ["Furthermore, 'total education time' is taken in all Japanese junior high school.",
'Japanese'],
3: ['I think this change is bad and Japanese government must change it to original form rapidly for the following reasons.',
'Japanese'],
4: ["Firstly, many young people this time cannot read or write basic words (Japanese 'kanji.') And, they cannot calculate, too.",
'Japanese'],
6: ["Originally, Japanese student got better score in reading and calculation than any other country's student few decades ago.",
'Japanese'],
7: ['For, reading, writing, and calculation were very important in Japanese society.',
'Japanese'],
8: ['Now, however, this good value in old Japan is being reduced.', 'Japan'],
9: ['This is very large problem in Japan.', 'Japan'],
16: ['This is very large problem in Japan, too.', 'Japan'],
20: ['This is very large problem in Japan.', 'Japan'],
21: ['In conclusion, there are problems like these in Japan, because of the reduction of basic education.',
'Japan'],
22: ['Luckily, the Japanese government is planning to change the education system.',
'Japanese'],
23: ['I hope this change will be going back to old Japanese school education system.',
'Japanese']}
The result is a list
After that, * japan * is searched by (re.IGNORECASE
) regardless of case, and the line containing that character is output in the dictionary type ofindex: [corresponding line, search character]
. ..
finditer
m2=re.compile(r"""
(?P<japan_txt>japan.*?)\b #test
""",re.VERBOSE|re.IGNORECASE)
result=re.finditer(m2,text)
print(result)
print('-'*10)
for i in result:
print(i)
finditer
that returns the result with Iterator type (https://docs.python.org/ja/3/library/stdtypes.html#typeiter)
result
<callable_iterator object at 0x7fe65d2e5ba8>
----------
<_sre.SRE_Match object; span=(4, 12), match='JAPANESE'>
<_sre.SRE_Match object; span=(33, 38), match='Japan'>
<_sre.SRE_Match object; span=(209, 217), match='Japanese'>
<_sre.SRE_Match object; span=(269, 277), match='Japanese'>
<_sre.SRE_Match object; span=(427, 435), match='Japanese'>
<_sre.SRE_Match object; span=(576, 584), match='Japanese'>
<_sre.SRE_Match object; span=(749, 757), match='Japanese'>
<_sre.SRE_Match object; span=(804, 809), match='Japan'>
<_sre.SRE_Match object; span=(858, 863), match='Japan'>
<_sre.SRE_Match object; span=(1368, 1373), match='Japan'>
<_sre.SRE_Match object; span=(1705, 1710), match='Japan'>
<_sre.SRE_Match object; span=(1760, 1765), match='Japan'>
<_sre.SRE_Match object; span=(1825, 1833), match='Japanese'>
<_sre.SRE_Match object; span=(1934, 1942), match='Japanese'>
The place and the matching part are returned.
groupdict
groupdict.py
m2=re.compile(r"""
(?P<japan_txt>japan.*?)\b #test
""",re.VERBOSE|re.IGNORECASE)
result=re.finditer(m2,text)
[i.groupdict() for i in result]
result
[{'japan_txt': 'JAPANESE'},
{'japan_txt': 'Japan'},
{'japan_txt': 'Japanese'},
{'japan_txt': 'Japanese'},
{'japan_txt': 'Japanese'},
{'japan_txt': 'Japanese'},
{'japan_txt': 'Japanese'},
{'japan_txt': 'Japan'},
{'japan_txt': 'Japan'},
{'japan_txt': 'Japan'},
{'japan_txt': 'Japan'},
{'japan_txt': 'Japan'},
{'japan_txt': 'Japanese'},
{'japan_txt': 'Japanese'}]
Matched characters and captured characters are returned
I tried various things for the time being, but it's still not enough I will end it once.
Recommended Posts