Written by Luka Kerr on May 3, 2019

Introduction & Overview

What Is An Operating System?

Kernel

Structure

Processes & Threads

Processes

A process can be in 3 states:

A processes’ information is stored in a process control block (PCB). Each entry in the PCB may contain items such as:

A process consists of at least 3 segments:

Threads

The thread model is a model that seperates execution from the environment. Items stored per thread include:

Each thread shares items across the process they are running in, including:

Threads are cheaper to create and destroy than processes, and can take advantage of the parallelism available on machines with more than one CPU.

Context Switch

A context switch refers to either a switch between threads or a switch between processes.

A context switch can occur anytime the operating system is invoked

Concurrency & Synchronisation

Concurrency Example

The following example demonstrates a race condition. Since the global variable count is shared between threads, but the function increment() isn’t run mutually exclusively, depending on the execution order, count could result in a different value after different runs of the program.

Thread 1

void increment() {
  int i;
  i = count;
  i = i + 1;
  count = i;
}

Thread 2

void increment() {
  int i;
  i = count;
  i = i - 1;
  count = i;
}

Critical Region

A critical region is a region of code where shared resources are accessed. In the example above, the critical region is from the line i = count to the line count = i.

We can control access to the shared resource by controlling access to the code that accesses the resource. As shown above, uncoordinated entry to the critical region results in a race condition.

To coordinate access to critical regions we can use a number of solutions:

Semaphores

Semaphores wait for a resource if it’s not available. During waiting, processes are put into a queue to avoid busy waiting.

When a resource is available, a process will signal, and any process waiting on that resource will be resumed.

Condition Variables

A condition variable is basically a container of threads that are waiting for a certain condition. They allow a process to wait within a monitor until some condition is met, before regaining exclusive access and resuming their task.

Deadlock

What Is Deadlock?

Formal definition: A set of processes is deadlocked if each process in the set is waiting for an event that only another process in the set can cause.

There are 4 conditions for deadlock to occur:

  1. Mutual exclusion condition, where each resource is assigned to 1 process or is available
  2. Hold and wait condition, where processes holding resources can request additional resources
  3. No preemption condition, where previously granted resources cannot be forcibly taken away
  4. Circular wait condition, where there must be a chain of > 1 processes, each waiting for a resource held by the next member of the chain

Example

Suppose a process holds a resource A, and requests a resource B. At the same time, a different process holds the resource B and requests A. Both processes cannot proceed and are deadlocked.

Mitigating Deadlock

Strategies for dealing with deadlock include:

Ignoring The Problem

This is a reasonable solution if deadlock is highly unlikely to occur. It is a trade off between convenience and correctness.

Prevention

By preventing one of the conditions for deadlock to occur, deadlock can be avoided.

A solution is to prevent the circular wait condition. This is done by ordering resources.

Detection & Recovery

If we can determine a system is deadlocked, we can try to recover and restore progress to the system.

Let:

Using two data structures:

resources can be kept track of, and deadlock can be detected.

The detection algorithm is as follows:

  1. Look for an unmarked process $P_i$ for which the $i$-th row of $R$ is less than or equal to $A$
  2. If found, add the $i$-th row of $C$ to $A$ and mark $P_i$. Go to step 1
  3. If no such process exists, terminate

At this point, any remaining processes are deadlocked.

To recover from deadlock various actions can be taken:

Dynamic Avoidance

Introduce a concept of ‘safe’ and ‘unsafe’ states. If the system is in a safe state, then it is not deadlocked, and there exists a scheduling order that results in every process running to completion.

To avoid deadlock, we can only grant requests that result in safe states.

Livelock

Similar to deadlock, but processes aren’t blocked. They can change state regularly, but don’t make progress.

Starvation

A process never receives the resource it is waiting for, despite the resource (repeatedly) becoming free, the resource is always allocated to another waiting process.

System Calls

System calls can be viewed as special function calls that provide for a controlled entry into the kernel. While in kernel mode, some operation is performed and then execution returns to the caller with a result.

Some examples include:

MIPS

In MIPS system calls are invoked via a syscall instruction. The syscall instruction causes an exception and transfers control to the general exception handler.

OS/161

In OS/161 syscall arguments passed and returned via the normal C function calling conventions.

Additionally:

Computer Hardware, Memory Heirarchy, Caching

Memory Hierarchy

Typical Access Time Memory Type Typical Capacity
1 nsec Registers < 1 KB
2nsec Cache 1 MB
10 nsec Main Memory 64 - 512 MB
10 msec Magnetic Disk 5 - 50 GB
100 sec Magnetic Tape 20 - 100 GB

Going down the hierarchy:

Caching

Caching allows for faster access to slower storage by using intermediate-speed storage as a cache.

CPU Cache

The CPU cache is located between the CPU and main memory.

Properties:

Effective Access Time

\[T_{eff} = H \times T_1 + (1 - H) \times T_2\]

where

For example:

\[T_{eff} = 0.95 \times 1 + (1 - 0.95) \times (1 + 10) = 1.5\]

We add $1$ to $T_2$ since we need to account for a lookup and miss in memory 1.

Disk Access

Disk access is very slow relative to main memory, and is dominated by the time taken to move the head over data.

To partially fix this, the operating system keeps a subset of the disk’s data in main memory.

This principle is used for other slow access times, such as accessing media via the internet.

File Management

File Structure Abstractions

Three kinds of files:

File Directories

Access Rights

UNIX Access Permissions

Sample Output Of ls -la:

total 16
drwxr-xr-x@  6 Luka  staff   192 29 Apr 13:43 .
drwxr-xr-x@  7 Luka  staff   224 24 Nov 12:42 ..
-rw-r--r--@  1 Luka  staff  6148 29 Apr 13:43 .DS_Store
drwxr-xr-x@ 15 Luka  staff   480 29 Apr 13:43 COMP2111
drwxr-xr-x@  7 Luka  staff   224 27 Apr 17:38 COMP3131
drwxr-xr-x@ 10 Luka  staff   320  1 May 11:50 COMP3231

There are 3 groups in a file’s permissions: user, group, other.

UNIX Storage Stack

Item Description
Application Syscall interface (open(), close()…)
File Descriptor Table Keeps track of files opened by each process individually
Open File Table Keeps track of files opened by user level processes
VFS Unified interface to multiple file systems
FS Handles physical location of data on disk, exposes files/directories
Buffer Cache Keep recently accessed disk blocks in memory
Disk Scheduler Schedule disk accesses from multiple processes
Device Driver Handles device specific protocols
Disk Controller Handles disk geometry, bad sectors
HDD/SSD Primary storage device

Contiguous Allocation

A file occupies a contiguous set of blocks on the disk.

Linked List Allocation

Each block contains a pointer to the next block in the chain. Free blocks are also linked in chain.

Pros:

Cons:

File Allocation Table

Keep a map of the entire file system in a separate table. The table is stored on the disk and is replicated in memory.

Pros:

Cons:

I-node Based File Structure

Have a separate table (index-node or i-node) for each file, only keeping tables for open files in memory.

Fragmentation

External Fragmentation

Internal Fragmentation

Example

Take a simplified version of an operating system with multiple processes running:

| P1 |P2|x| . P3 . |xx|  P4 .|x| P5 |

The spaces marked with an x are free memory locations in the operating system, and the spaces marked with a . are free memory inside individual processes.

The spaces marked with an x are examples of external fragmentation, since a new process won’t be big enough to fit in any of the spaces individually, but overall there is enough memory to fit the process.

The spaces marked with a . are examples of internal fragmentation, since memory is left unused inside individual processes, but this memory can’t be used by other processes.

Virtual File System

struct vfs {
  int           (*fs_sync)(struct fs *);       /* flush content to disk */
  const char   *(*fs_getvolname)(struct fs *); /* get volume name */
  struct vnode *(*fs_getroot)(struct fs *);    /* get the root vnode */
  int           (*fs_unmount)(struct fs *);    /* unmount the file system */
  
  void *fd_data /* private file system specific data */
};

Some VFS functions include:

int vfs_open(char *path, int openflags, mode_t mode, struct vnode **ret);
void vfs_close(struct vnode *vn);
int vfs_readlink(char *path, struct uio *data);
int vfs_symlink(const char *contents, char *path);
int vfs_mkdir(char *path);
int vfs_link(char *oldpath, char *newpath);
int vfs_remove(char *path);
int vfs_rmdir(char *path);
int vfs_rename(char *oldpath, char *newpath);
int vfs_chdir(char *path);
int vfs_getcwd(struct uio *buf);

Vnode

struct vnode {
  int                    vn_refcount;  /* reference count for this vnode */
  struct spinlock        vn_countlock; /* lock for mutex */
  struct fs              *vn_fs;       /* file system containing this vnode */
  void                   *vn_data;     /* file system specific vnode data */
  const struct vnode_ops *vn_ops;      /* pointers to functions operating on vnodes */
}

Some Vnode operations include:

struct vnode_ops {
  int (*vop_eachopen)(struct vnode *object, int flags_from_open);
  int (*vop_reclaim)(struct vnode *vnode);
  int (*vop_read)(struct vnode *file, struct uio *uio);
  int (*vop_readlink)(struct vnode *link, struct uio *uio);
  int (*vop_getdirentry)(struct vnode *dir, struct uio *uio);
  int (*vop_write)(struct vnode *file, struct uio *uio);
  int (*vop_ioctl)(struct vnode *object, int op, userptr_t data);
  int (*vop_stat)(struct vnode *object, struct stat *statbuf);
  int (*vop_gettype)(struct vnode *object, int *result);
  int (*vop_isseekable)(struct vnode *object, off_t pos);
  int (*vop_fsync)(struct vnode *object);
  int (*vop_truncate)(struct vnode *file, off_t len);
  int (*vop_namefile)(struct vnode *file, struct uio *uio);
  int (*vop_creat)(struct vnode *dir, const char *name, int excl, struct vnode **result);
  int (*vop_symlink)(struct vnode *dir, const char *contents, const char *name);
  int (*vop_mkdir)(struct vnode *parentdir, const char *name);
  int (*vop_link)(struct vnode *dir, const char *name, struct vnode *file);
  int (*vop_remove)(struct vnode *dir, const char *name);
  int (*vop_rmdir)(struct vnode *dir, const char *name);
  int (*vop_rename)(struct vnode *vn1, const char *name1, struct vnode *vn2, const char *name2);
  int (*vop_lookup)(struct vnode *dir, char *pathname, struct vnode **result);
  int (*vop_lookparent)(struct vnode *dir, char *pathname, struct vnode **result, char *buf, size_t len);
}

File Descriptor Table & Open File Table

A file (read Vnode) can be represented as a file struct:

struct file {
  int           flags;           /* what permissions this file is opened with */
  off_t         offset;          /* offset into this file */
  struct vnode *vn;              /* pointer to underlying vnode */
  struct lock  *flock;           /* lock for mutex to file */
  unsigned int  reference_count; /* how many processes are referencing this file */
}

The operating system contains a global open file table for all processes:

struct file *file_table[__OPEN_MAX];

Each process contains a file descriptor table containing pointers to file’s in the open file table:

struct file **file_descriptor_table[__OPEN_MAX];

Buffer Cache

Used as temporary storage when transferring data between two entities.

Buffering Disk Blocks

Cache

Fast storage used to temporarily hold data to speed up repeated access to the data.

Caching Disk Blocks

File System Consistency

File data is expected to survive. If a LRU policy was implemented for evicting the buffer cache, critical data may be in memory forever if it is frequently used.

Generally, cached disk blocks are prioritised in terms of how critical they are to file system consistency. In UNIX, flushd flushes all modified blocks to the disk every 30 seconds.

An alternative is to use a write through cache, which writes modified blocks immediately to the disk. This has worse performance and generates much more disk traffic. It is still used for some devices, for example in USB drives.

Disk Scheduler

Access time for a disk is calculated as follows:

\[T_a = T_s + \dfrac{1}{2r} + \dfrac{b}{rN}\]

where:

Disk Arm Scheduling Algorithms

First Come First Serve (FCFS)

Shortest Seek Time First (SSTF)

Elevator Algorithm (SCAN)

Modified Elevator (C-SCAN)

Ext2

Ext2 Inodes

Indirection

-------------------
|                 |    --------    --------
| single indirect | -> |      | -> | data |
|                 |    --------    --------
------------------|
|                 |    --------    --------    --------
| double indirect | -> |      | -> |      | -> | data |
|                 |    --------    --------    --------
------------------|
|                 |    --------    --------    --------    --------
| triple indirect | -> |      | -> |      | -> |      | -> | data |
|                 |    --------    --------    --------    --------
-------------------

Assuming 4 byte block numbers and 1K blocks, the max file size is calculated as $12 + 256 + 65536 + 16777216 = 16843020$ blocks $\approx$ 16GB.

Calculated from:

Single Indirect Block

A block number of a block containing block numbers.

Requires two disk access to read, one for the indirect block and one for the target block.

Double Indirect Block

Block number of a block containing block numbers of blocks containing block numbers.

Triple Indirect Block

Block number of a block containing block numbers of blocks containing block numbers of blocks containing block numbers.

Access Patterns

File System Reliability

Ext3

Journaling

Journaling Block Device

Interface:

  1. Start a new transaction
  2. Update a disk block as part of a transaction
  3. Complete a transaction
  4. Commit a transaction by writing it’s data to the journal (persistent storage)
  5. Checkpoint a transaction by flushing the journal to the disk

Modes

Ext3 supports two journaling modes:

Memory Management

Fixed Partitioning

Problems

Dynamic Partitioning

Problems

Dynamic Partitioning Allocation Algorithms

Classic Approach

Dynamic Partitioning Placement Algorithm

First fit is generally the better performing algorithm, and easier to implement.

First Fit
Next Fit
Best Fit
Worst Fit

Compaction

Compaction is the shuffling of memory contents to place all free memory together in one large block. It generally requires hardware support.

External fragmentation can be reduced using compaction.

Swapping

A process can be swapped temporarily out of memory to a backing store, and then brought back into memory for continued execution.

Virtual Memory

Two classic variants:

Paging

60K - 64K |  X  | <- Virtual Page
56K - 60K |  X  |
52K - 56K |  3  | \
...                -> |-----| 28K - 32K  <- Page frame
4K - 8K   |  X  |     |-----| ...
0K - 4K   |  5  | --> |-----| 0K - 4K

Segmentation

Page Faults

Referencing an invalid page triggers a page fault.

There are two main types:

Page Table

A page table is structured logically as an array of frame numbers. Each page table entry (PTE) also has other bits:

Two Level Page Table

Contains a root level page table, and a second level page table for each entry in the root level.

        Virtual Address             -------------------------v
-------------------------------     |     -------------------------      
| 10 bits | 10 bits | 12 bits | -----     | Frame Number | Offset |     
-------------------------------           -------------------------  
     |         |              Second Level      ^        |
     |         |             ----------------   |        |          Main Memory
     |         |             |              |   |        |         --------------
     |         ------------> | Frame Number | ---        |         |            |
     |     Root Level     |  |              |            |         |            |
     |     -----------    |  |              |             -------> | Page Frame |
     |     |         |    |  ----------------                      |            |
     |---> | Pointer | ----                                        |            |
           |         |                                             |            |
           -----------                                             --------------

Inverted Page Table

An array of page numbers stored by frame number.

Hashed Page Table

Similar to the inverted page table, but more efficient with its size based on physical memory size and not virtual.

Translation Lookaside Buffer

Properties

On a context switch, the TLB needs to be flushed so that all its entries are invalidated.

MIPS R3000

MIPS R3000 TLB

struct entry {
  /* entry hi */
  unsigned int vpn  : 20; /* virtual page number */
  unsigned int asid : 6;  /* asid */
  /* next 6 bits unused */
  
  /* entry lo */
  unsigned int pfn : 20; /* page frame number */
  unsigned int n   : 1;  /* not cacheable bit */
  unsigned int d   : 1;  /* dirty bit */
  unsigned int v   : 1;  /* valid bit */
  unsigned int g   : 1;  /* global bit */
  /* next 8 bits unused */
};

MIPS R3000 Address Space Layout

Principle Of Locality

An implication of locality is that we can reasonably predict what instructions and data a program will use in the near future based on its accesses in the recent past

Temporal locality

Recently accessed items are likely to be accessed in the near future.

Spatial locality

Items whose addresses are near one another tend to be referenced close together in time.

Working Set

The pages/segments required by an application in a time window ($\Delta$) is called its memory working set.

Thrashing

Occurs when a computer’s virtual memory resources are overused, leading to a constant state of paging and page faults, inhibiting most application-level processing.

To recover from thrashing we can suspend a few processes to reduce degree of multiprogramming.

Page Replacement

In terms of performance, the policies below rank as:

  1. Optimal
  2. LRU
  3. Clock
  4. FIFO

Optimal Replacement Policy

Remove the page that won’t be used for the longest time.

FIFO Replacement Policy

Remove the oldest page.

Least Recently Used Replacement Policy

Remove the least recently used page.

Clock Replacement Policy

When a page is used, set a reference bit to 1. When scanning for a victim, remove the first page with a reference bit of 0, and set all reference bits to 0.

I/O

Device Drivers

A device drivers job is to translate requests through the device-independent standard interface (open, close, read, write) into appropriate sequence of commands (register manipulations) for the particular hardware.

Programmed I/O

Interrupt Driven I/O

Direct Memory Access

Buffering

Firstly, let

No Buffering

Cost: $T + C$

User Level Buffering

Single Buffer

Cost: $max(T, C) + M$

Double Buffer

Cost: $max(T, C + M)$

Circular Buffer

Multiprocessors

Bus Based UMA

Cache Consistency

Multiprocessor Operating Systems

Individual Operating Systems

Master Slave Multiprocessors

Symmetric Multiprocessors

MCS Locks

Scheduling

Preemptive And Non-Preemptive Scheduling

Scheduling Algorithms

Interactive System Algorithms

Real Time System Algorithms

Interactive Scheduling

Round Robin Scheduling

Priorities

Single Shared Ready Queue

Real Time Scheduling

Given

then the load can only be handled if $\Sigma_{i = 1}^{m} \dfrac{C_i}{P_i} \le 1$.

Soft Real Time Systems

Must mostly meet all deadlines. For example:

Hard Real Time Systems

Must always meet all deadlines. For example:

Rate Monotonic Scheduling

Earliest Deadline First Scheduling