CRuby code reading (3): rb_bug execution line output

This theme

This is an article that I will read about the CRuby implementation. This is the third time.

I'm busy with work today (I arrived at my house at 23:30 ...) and my post was delayed. The last time was relatively heavy, and I will go to a place where the content can be lightened.

So, let's read the rb_bug that is used to spit out the error logs that we often see. The goal is to read "How do you get the number of lines in the source file when an error is output?"

`rb_bug`

rb_bug is defined in error.c.

void
rb_bug(const char *fmt, ...)
{
    const char *file = NULL;
    int line = 0;

    if (GET_THREAD()) {
	file = rb_sourcefile();
	line = rb_sourceline();
    }

    report_bug(file, line, fmt, NULL);

    die();
}

It's a very straightforward code. I read GET_THREAD () last time, but it was to get the current thread object.

Acquire the current thread object
Get the running source file and its number of lines,
Output with the report_bug function
Forcibly terminate the processing system (die ())

You can read it as if it is the code.

There are two main questions.

How do rb_sourcefile () and rb_sourceline () get the number of executed lines? ――Why isn't the thread object that was GET_THREAD () passed as an argument?
How is report_bug different from just fprintf (stderr, ...

The latter has nothing to do with this purpose, so I will read it soon and put it on hold.

Let's read the former.

`rb_sourcefile ()` and `rb_sourceline ()`

The definitions for rb_sourcefile () and rb_sourceline () were in vm.c.

const char *
rb_sourcefile(void)
{
    rb_thread_t *th = GET_THREAD();
    rb_control_frame_t *cfp = rb_vm_get_ruby_level_next_cfp(th, th->cfp);

    if (cfp) {
	return RSTRING_PTR(cfp->iseq->location.path);
    }
    else {
	return 0;
    }
}

int
rb_sourceline(void)
{
    rb_thread_t *th = GET_THREAD();
    rb_control_frame_t *cfp = rb_vm_get_ruby_level_next_cfp(th, th->cfp);

    if (cfp) {
	return rb_vm_get_sourceline(cfp);
    }
    else {
	return 0;
    }
}

For the time being, what is noticeable is that both are GET_THREAD () at the beginning. It seems that the reason I didn't pass GET_THREAD () was because I ended up reacquiring it inside. Was it a judgment that it is not common sense to rb_sourcefile () for anything other than the current thread?

And again, both by rb_vm_get_ruby_level_next_cfp () I'm getting a pointer to rb_control_frame_t as cfp.

And if cfp exists, it seems to get the actual filename and number of lines from this struct. From the name rb_control_frame_t, we can somehow understand that cfp is a structure with execution context information. This construction of contextual information will not be understood without reading the entire VM, For the time being, I can imagine that the execution context contains information that can calculate the executable file and execution line, so skip it without reading it.

In fact, the rb_sourcefile () side seems to read ʻiseq-> location.path in the structure and return it as a Ruby string. So I found that ʻiseq-> location.path is still kept as a running file in the VM, and by getting this the error message can print the proper file path.

So, the rest is rb_vm_get_sourceline () which is used to extract line information from cfp on therb_sourceline ()side ...!

`rb_vm_get_sourceline()`

This definition exists in vm_backtrace.c.

inline static int
calc_lineno(const rb_iseq_t *iseq, const VALUE *pc)
{
    return rb_iseq_line_no(iseq, pc - iseq->iseq_encoded);
}

int
rb_vm_get_sourceline(const rb_control_frame_t *cfp)
{
    int lineno = 0;
    const rb_iseq_t *iseq = cfp->iseq;

    if (RUBY_VM_NORMAL_ISEQ_P(iseq)) {
	lineno = calc_lineno(cfp->iseq, cfp->pc);
    }
    return lineno;
}

RUBY_VM_NORMAL_ISEQ_P is in vm_core.h, Well, I skipped it because there may be some conditions for acquisition failure.

calc_lineno () subtracts ʻiseq-> iseq_encoded from the pointer of cfp-> pc`.

First of all, when you say ʻiseq in the source of the VM, it probably means the instruction sequence. And when you look at how to use ʻiseq_encoded

#define REG_PC  (REG_CFP->pc)
#define GET_PC_COUNT()     (REG_PC - GET_ISEQ()->iseq_encoded)

I also found something like.

By subtracting ʻiseq-> iseq_encoded from the pointer of pc, it becomes PC_COUNT. pc is the actual execution position in memory space, and ʻiseq-> iseq_encoded is considered to be the start pointer of the array containing the instructions.

And it will fly deeper and deeper, This function is also deferred to rb_iseq_line_no (). There is a definition in iseq.c.

unsigned int
rb_iseq_line_no(const rb_iseq_t *iseq, size_t pos)
{
    if (pos == 0) {
	return find_line_no(iseq, pos);
    }
    else {
	return find_line_no(iseq, pos - 1);
    }
}

If pos is 0, it is as it is, otherwise it isfind_line_no ()for the instruction at the previous position. This is probably trying to find the line where the error actually occurred by moving the execution position back one step.

Then find_line_no () is as follows.

static unsigned int
find_line_no(const rb_iseq_t *iseq, size_t pos)
{
    struct iseq_line_info_entry *entry = get_line_info(iseq, pos);
    if (entry) {
	return entry->line_no;
    }
    else {
	return 0;
    }
}

If ʻiseq_line_info_entryis found, it can be read that theline_no written there is the execution line. The definition of ʻiseq_line_info_entry was as follows.

struct iseq_line_info_entry {
    unsigned int position;
    unsigned int line_no;
};

It's a simple form of position and line_no. Isn't it a structure that expresses which line a certain instruction position corresponds to? It is considered.

Then get_line_info () is ...

static struct iseq_line_info_entry *
get_line_info(const rb_iseq_t *iseq, size_t pos)
{
<Omission>
  :
    return &table[i-1];
}

I finally arrived. This seems to be the true identity. Looking from above ...

<Omission>
  :
    size_t i = 0, size = iseq->line_info_size;
    struct iseq_line_info_entry *table = iseq->line_info_table;
  :
<Omission>

First, we are extracting line_info_table from ʻiseq to table. This table is a pointer to ʻiseq_line_info_entry, so You can see that it is an array that lists the correspondence between the command word and the number of lines in the original file of the command word.

Continue reading get_line_info ().

<Omission>
  :
    const int debug = 0;

    if (debug) {
	printf("size: %"PRIdSIZE"\n", size);
	printf("table[%"PRIdSIZE"]: position: %d, line: %d, pos: %"PRIdSIZE"\n",
	       i, table[i].position, table[i].line_no, pos);
    }
  :
<Omission>

Debugging utility …… It seems that it outputs the table size. I will skip it.

<Omission>
  :
    if (size == 0) {
	return 0;
    }
    else if (size == 1) {
	return &table[0];
    }
    else <The following is omitted>

There was a big if, but the first two conditions. If the size of table is 0, 0 (= NULL) is returned, and if it is 1, the pointer to the first element is returned as it is. (By the way, shouldn't this 0 be changed to a null constant?)

The main body is in else.

<Omission>
  :
	for (i=1; i<size; i++) {
	    if (debug) printf("table[%"PRIdSIZE"]: position: %d, line: %d, pos: %"PRIdSIZE"\n",
			      i, table[i].position, table[i].line_no, pos);

	    if (table[i].position == pos) {
		return &table[i];
	    }
	    if (table[i].position > pos) {
		return &table[i-1];
	    }
	}
    }
    return &table[i-1];
}   // get_line_info()So far

It scans table to find a position where the position value exceeds the pos given as an argument, and returns a pointer to that position. In other words, it's a linear search.

However, if you look at it in order from the front and go to a position that exceeds the position, it will be returned there, so position should increase monotonously. This means that ʻiseq_line_info_entry apparently indicates that the positionth instruction of ʻiseq is after line_pos. Also, you can read that table is an array sorted by position.

Hmm···? Is the place that is mostly used for debugging? In that case, the correct answer is to use a simple algorithm.

Also, if it cannot be found to the end, it seems to return the one before ʻi outside the else clause. At this point ʻi should match size, which means it returns the last element.

Now you know how the execution row is acquired.

Summary

Here's how rb_bug (), which is used to generate error messages, gets the files and lines it's running.

--Executable file --Get this because it is held in the VM execution context information (rb_control_frame_t) --Execution line --The VM instruction sequence information contains a table that holds which line of ruby code the VM instruction belongs to. --The execution context information of the VM gives the instruction pointer (pc) being executed, so index this table to get the execution row.

by the way

The table seemed to be sorted, but why not use binary search etc.? When I thought about it, the following comments remained in the source.

/* TODO: search algorithm is brute force.
         this should be binary search or so. */

According to the comments, it is better to change to a more efficient one. Considering the location, I think it would be better to keep it as simple as it is now. Since it's a big deal, I think I'll try Pull Request later.

Other

This time I felt that there was a lot of speculation. However, I think that I have acquired not a little knowledge among them. By reading slowly, I will gradually increase my knowledge and expand the range of understanding.