Written by Luka Kerr on May 3, 2019
Introduction & Overview
What Is An Operating System?
- The operating system is an abstract machine
- The operating system is a resource manager
- The operating system is the software privileged mode
Kernel
- The portion of the operating system that is running in privileged mode
- Usually stays in main memory
Structure
- Usually monolithic
- Has many interdependencies:
- Scheduling on virtual memory
- Virtual memory (VM) on I/O to disk
- VM on files
- Files on VM
- … and so on
Processes & Threads
Processes
- Also called a task or job
- Execution of an individual program
- Encompasses one or more threads
A process can be in 3 states:
- Running
- Blocked
- Ready
A processes’ information is stored in a process control block (PCB). Each entry in the PCB may contain items such as:
- Registers
- Process state
- Process ID
- Pointers to different memory segments (code, data, stack)
- Priority
A process consists of at least 3 segments:
- Code
- Data
- Stack
Threads
- Unit of execution
- Can be traced
- Belongs to a process
The thread model is a model that seperates execution from the environment. Items stored per thread include:
- Program counter
- Registers
- Stack
- State
- Local variables
Each thread shares items across the process they are running in, including:
- Global variables
- Open files
- Address space
Threads are cheaper to create and destroy than processes, and can take advantage of the parallelism available on machines with more than one CPU.
Context Switch
A context switch refers to either a switch between threads or a switch between processes.
A context switch can occur anytime the operating system is invoked
- On a system call
- On an exception
- On an interrupt
Concurrency & Synchronisation
Concurrency Example
The following example demonstrates a race condition. Since the global variable count
is
shared between threads, but the function increment()
isn’t run mutually exclusively,
depending on the execution order, count
could result in a different value after different
runs of the program.
Thread 1
void increment() {
int i;
i = count;
i = i + 1;
count = i;
}
Thread 2
void increment() {
int i;
i = count;
i = i - 1;
count = i;
}
Critical Region
A critical region is a region of code where shared resources are accessed. In the example
above, the critical region is from the line i = count
to the line count = i
.
We can control access to the shared resource by controlling access to the code that accesses the resource. As shown above, uncoordinated entry to the critical region results in a race condition.
To coordinate access to critical regions we can use a number of solutions:
- Locks
- Acquire lock before critical region, release after critical region
- Disabling Interrupts
- Before entering a critical region, disable interrupts
- After leaving the critical region, enable interrupts
- Hardware Support
- Test and set instruction
- If
lock == 0
, set and acquire the lock - If
lock == 1
, another process/thread has the lock
- Semaphores
P()
: test/wait/downV()
: increment/signal/up
- Monitors
- A set of procedures, variables, data types are grouped in a special kind of module, a monitor
- Only one process/thread can be in the monitor at any one time
Semaphores
Semaphores wait for a resource if it’s not available. During waiting, processes are put into a queue to avoid busy waiting.
When a resource is available, a process will signal, and any process waiting on that resource will be resumed.
Condition Variables
A condition variable is basically a container of threads that are waiting for a certain condition. They allow a process to wait within a monitor until some condition is met, before regaining exclusive access and resuming their task.
Deadlock
What Is Deadlock?
Formal definition: A set of processes is deadlocked if each process in the set is waiting for an event that only another process in the set can cause.
There are 4 conditions for deadlock to occur:
- Mutual exclusion condition, where each resource is assigned to 1 process or is available
- Hold and wait condition, where processes holding resources can request additional resources
- No preemption condition, where previously granted resources cannot be forcibly taken away
- Circular wait condition, where there must be a chain of > 1 processes, each waiting for a resource held by the next member of the chain
Example
Suppose a process holds a resource A, and requests a resource B. At the same time, a different process holds the resource B and requests A. Both processes cannot proceed and are deadlocked.
Mitigating Deadlock
Strategies for dealing with deadlock include:
- Ignoring the problem altogether
- Prevention
- Detection and recovery
- Dynamic avoidance
Ignoring The Problem
This is a reasonable solution if deadlock is highly unlikely to occur. It is a trade off between convenience and correctness.
Prevention
By preventing one of the conditions for deadlock to occur, deadlock can be avoided.
- Preventing mutual exclusion usually isn’t feasible
- Preventing the hold and wait condition requires processes to request all resources needed on startup, which is not always possible
- Preventing the no preemption condition isn’t feasible, since forcible taking resources from a process is not a viable option
A solution is to prevent the circular wait condition. This is done by ordering resources.
Detection & Recovery
If we can determine a system is deadlocked, we can try to recover and restore progress to the system.
Let:
- The resources in existence be $E = (e_1, e_2, \dots , e_m)$
- the resources available be $A = (a_1, a_2, \dots , a_m)$
Using two data structures:
- A current allocation matrix ($C$) where row $n$ is the current allocation to process $n$ \(\begin{bmatrix} c_{11} & c_{12} & c_{13} & \cdots & c_{1m} \\ c_{21} & c_{22} & c_{23} & \cdots & c_{2m} \\ \vdots & \vdots & \vdots & & \vdots \\ c_{n1} & c_{n2} & c_{n3} & \cdots & c_{nm} \end{bmatrix}\)
- A request matrix ($R$) where row $n$ is what process $n$ needs \(\begin{bmatrix} r_{11} & r_{12} & r_{13} & \cdots & r_{1m} \\ r_{21} & r_{22} & r_{23} & \cdots & r_{2m} \\ \vdots & \vdots & \vdots & & \vdots \\ r_{n1} & r_{n2} & r_{n3} & \cdots & r_{nm} \end{bmatrix}\)
resources can be kept track of, and deadlock can be detected.
The detection algorithm is as follows:
- Look for an unmarked process $P_i$ for which the $i$-th row of $R$ is less than or equal to $A$
- If found, add the $i$-th row of $C$ to $A$ and mark $P_i$. Go to step 1
- If no such process exists, terminate
At this point, any remaining processes are deadlocked.
To recover from deadlock various actions can be taken:
- Take a resource from another process
- Rollback to a previous state (no guarantee deadlock won’t occur again)
- Kill processes
Dynamic Avoidance
Introduce a concept of ‘safe’ and ‘unsafe’ states. If the system is in a safe state, then it is not deadlocked, and there exists a scheduling order that results in every process running to completion.
To avoid deadlock, we can only grant requests that result in safe states.
Livelock
Similar to deadlock, but processes aren’t blocked. They can change state regularly, but don’t make progress.
Starvation
A process never receives the resource it is waiting for, despite the resource (repeatedly) becoming free, the resource is always allocated to another waiting process.
System Calls
System calls can be viewed as special function calls that provide for a controlled entry into the kernel. While in kernel mode, some operation is performed and then execution returns to the caller with a result.
Some examples include:
- File operations (
read()
,write()
, …) - Process management (
fork()
,exit()
, …) - Directory management (
mkdir()
,link()
, …)
MIPS
In MIPS system calls are invoked via a syscall
instruction. The syscall
instruction
causes an exception and transfers control to the general exception handler.
OS/161
In OS/161 syscall arguments passed and returned via the normal C function calling conventions.
Additionally:
- Register
v0
contains the system call number - On return register
a3
contains:- 0 if success with
v0
containing the result - Non-zero if error with
errno
containing the error, andv0
containing -1
- 0 if success with
Computer Hardware, Memory Heirarchy, Caching
Memory Hierarchy
Typical Access Time | Memory Type | Typical Capacity |
---|---|---|
1 nsec | Registers | < 1 KB |
2nsec | Cache | 1 MB |
10 nsec | Main Memory | 64 - 512 MB |
10 msec | Magnetic Disk | 5 - 50 GB |
100 sec | Magnetic Tape | 20 - 100 GB |
Going down the hierarchy:
- Decreasing cost per bit
- Increasing capacity
- Increasing access time
- Decreasing frequency of access to the memory by the processor
Caching
Caching allows for faster access to slower storage by using intermediate-speed storage as a cache.
CPU Cache
The CPU cache is located between the CPU and main memory.
Properties:
- 1 to a few cycles when accessed
- Is hardware maintained
- Sizes range from a few KB to tens of MB
- Uses a hierarchy of caches (2-5 levels)
Effective Access Time
\[T_{eff} = H \times T_1 + (1 - H) \times T_2\]where
- $T_1$ = access time of memory 1
- $T_2$ = access time of memory 2
- $H$ = hit rate in memory 1
For example:
- Cache memory access time 1ns
- Main memory access time 10ns
- Hit rate of 95%
We add $1$ to $T_2$ since we need to account for a lookup and miss in memory 1.
Disk Access
Disk access is very slow relative to main memory, and is dominated by the time taken to move the head over data.
To partially fix this, the operating system keeps a subset of the disk’s data in main memory.
This principle is used for other slow access times, such as accessing media via the internet.
File Management
File Structure Abstractions
Three kinds of files:
- Byte sequence
- Unstructured, simple management for OS
- Record sequence
- Collection of bytes treated as a unit
- Key-based, tree structured
- Records of variable length, each has an associated key
File Directories
- Provide mapping between file names and the files themselves
- Contain information about files (attributes, location, ownership)
- Directory itself is a file owned by the operating system
Access Rights
- None
- Can’t read, see or write to file
- Knowledge
- Can see the file and who the owner is
- Execution
- Can load and execute a program but can’t copy or write to it
- Reading
- Can read from a file
- Appending
- Can add data to the file but cannot modify or delete any of the file’s contents
- Updating
- Can modify, delete, and add to the file’s data
- Changing protection
- Can change access rights granted to other users
- Deletion
- Can delete a file
- Owners
- Has all rights previously listed
- May grant rights to others
UNIX Access Permissions
Sample Output Of ls -la
:
total 16
drwxr-xr-x@ 6 Luka staff 192 29 Apr 13:43 .
drwxr-xr-x@ 7 Luka staff 224 24 Nov 12:42 ..
-rw-r--r--@ 1 Luka staff 6148 29 Apr 13:43 .DS_Store
drwxr-xr-x@ 15 Luka staff 480 29 Apr 13:43 COMP2111
drwxr-xr-x@ 7 Luka staff 224 27 Apr 17:38 COMP3131
drwxr-xr-x@ 10 Luka staff 320 1 May 11:50 COMP3231
There are 3 groups in a file’s permissions: user, group, other.
d
for directoriesr
for read permissionsw
for write permissionsx
for execute permissions
UNIX Storage Stack
Item | Description |
---|---|
Application | Syscall interface (open() , close() …) |
File Descriptor Table | Keeps track of files opened by each process individually |
Open File Table | Keeps track of files opened by user level processes |
VFS | Unified interface to multiple file systems |
FS | Handles physical location of data on disk, exposes files/directories |
Buffer Cache | Keep recently accessed disk blocks in memory |
Disk Scheduler | Schedule disk accesses from multiple processes |
Device Driver | Handles device specific protocols |
Disk Controller | Handles disk geometry, bad sectors |
HDD/SSD | Primary storage device |
Contiguous Allocation
A file occupies a contiguous set of blocks on the disk.
Linked List Allocation
Each block contains a pointer to the next block in the chain. Free blocks are also linked in chain.
Pros:
- Only single metadata entry per file
- Best for sequential files
Cons:
- Poor for random access
- Blocks end up scattered across the disk due to free list eventually being randomised
File Allocation Table
Keep a map of the entire file system in a separate table. The table is stored on the disk and is replicated in memory.
Pros:
- Random access is fast
Cons:
- Requires a lot of memory for large disks
- Free block lookup is slow
I-node Based File Structure
Have a separate table (index-node or i-node) for each file, only keeping tables for open files in memory.
Fragmentation
External Fragmentation
- The space wasted external to the allocated memory regions
- Memory space exists to satisfy a request but it is unusable as it is not contiguous
Internal Fragmentation
- The space wasted internal to the allocated memory regions
- Allocated memory may be slightly larger than requested memory; this size difference is wasted memory internal to a partition
Example
Take a simplified version of an operating system with multiple processes running:
| P1 |P2|x| . P3 . |xx| P4 .|x| P5 |
The spaces marked with an x
are free memory locations in the operating system, and the spaces
marked with a .
are free memory inside individual processes.
The spaces marked with an x
are examples of external fragmentation, since a new process
won’t be big enough to fit in any of the spaces individually, but overall there is enough
memory to fit the process.
The spaces marked with a .
are examples of internal fragmentation, since memory is left unused
inside individual processes, but this memory can’t be used by other processes.
Virtual File System
- Provides single system call interface for many file systems (DOS, Ext2, …)
- Transparent handling of network file systems (NFS, AFS, …)
- File-based interface to arbitrary device drivers (
/dev
) and kernel data structures (/proc
) - Provides an indirection layer for system calls
struct vfs {
int (*fs_sync)(struct fs *); /* flush content to disk */
const char *(*fs_getvolname)(struct fs *); /* get volume name */
struct vnode *(*fs_getroot)(struct fs *); /* get the root vnode */
int (*fs_unmount)(struct fs *); /* unmount the file system */
void *fd_data /* private file system specific data */
};
Some VFS functions include:
int vfs_open(char *path, int openflags, mode_t mode, struct vnode **ret);
void vfs_close(struct vnode *vn);
int vfs_readlink(char *path, struct uio *data);
int vfs_symlink(const char *contents, char *path);
int vfs_mkdir(char *path);
int vfs_link(char *oldpath, char *newpath);
int vfs_remove(char *path);
int vfs_rmdir(char *path);
int vfs_rename(char *oldpath, char *newpath);
int vfs_chdir(char *path);
int vfs_getcwd(struct uio *buf);
Vnode
- Represents a file (inode) in the underlying filesystem
- Points to the real inode
- Contains pointers to functions to manipulate files/inodes
struct vnode {
int vn_refcount; /* reference count for this vnode */
struct spinlock vn_countlock; /* lock for mutex */
struct fs *vn_fs; /* file system containing this vnode */
void *vn_data; /* file system specific vnode data */
const struct vnode_ops *vn_ops; /* pointers to functions operating on vnodes */
}
Some Vnode operations include:
struct vnode_ops {
int (*vop_eachopen)(struct vnode *object, int flags_from_open);
int (*vop_reclaim)(struct vnode *vnode);
int (*vop_read)(struct vnode *file, struct uio *uio);
int (*vop_readlink)(struct vnode *link, struct uio *uio);
int (*vop_getdirentry)(struct vnode *dir, struct uio *uio);
int (*vop_write)(struct vnode *file, struct uio *uio);
int (*vop_ioctl)(struct vnode *object, int op, userptr_t data);
int (*vop_stat)(struct vnode *object, struct stat *statbuf);
int (*vop_gettype)(struct vnode *object, int *result);
int (*vop_isseekable)(struct vnode *object, off_t pos);
int (*vop_fsync)(struct vnode *object);
int (*vop_truncate)(struct vnode *file, off_t len);
int (*vop_namefile)(struct vnode *file, struct uio *uio);
int (*vop_creat)(struct vnode *dir, const char *name, int excl, struct vnode **result);
int (*vop_symlink)(struct vnode *dir, const char *contents, const char *name);
int (*vop_mkdir)(struct vnode *parentdir, const char *name);
int (*vop_link)(struct vnode *dir, const char *name, struct vnode *file);
int (*vop_remove)(struct vnode *dir, const char *name);
int (*vop_rmdir)(struct vnode *dir, const char *name);
int (*vop_rename)(struct vnode *vn1, const char *name1, struct vnode *vn2, const char *name2);
int (*vop_lookup)(struct vnode *dir, char *pathname, struct vnode **result);
int (*vop_lookparent)(struct vnode *dir, char *pathname, struct vnode **result, char *buf, size_t len);
}
File Descriptor Table & Open File Table
A file (read Vnode) can be represented as a file
struct:
struct file {
int flags; /* what permissions this file is opened with */
off_t offset; /* offset into this file */
struct vnode *vn; /* pointer to underlying vnode */
struct lock *flock; /* lock for mutex to file */
unsigned int reference_count; /* how many processes are referencing this file */
}
The operating system contains a global open file table for all processes:
struct file *file_table[__OPEN_MAX];
Each process contains a file descriptor table containing pointers to file
’s in the open
file table:
struct file **file_descriptor_table[__OPEN_MAX];
Buffer Cache
Used as temporary storage when transferring data between two entities.
Buffering Disk Blocks
- Allow applications to work with arbitrarily sized region of a file
- Writes can return immediately after copying to kernel buffer
- Can implement read-ahead by pre-loading next block on disk into kernel buffer
Cache
Fast storage used to temporarily hold data to speed up repeated access to the data.
Caching Disk Blocks
- On access
- Before loading block from disk, check if it is in cache first
- Avoids disk access
- Can optimise for repeated access for single or several processes
File System Consistency
File data is expected to survive. If a LRU policy was implemented for evicting the buffer cache, critical data may be in memory forever if it is frequently used.
Generally, cached disk blocks are prioritised in terms of how critical they are to file system
consistency. In UNIX, flushd
flushes all modified blocks to the disk every 30 seconds.
An alternative is to use a write through cache, which writes modified blocks immediately to the disk. This has worse performance and generates much more disk traffic. It is still used for some devices, for example in USB drives.
Disk Scheduler
Access time for a disk is calculated as follows:
\[T_a = T_s + \dfrac{1}{2r} + \dfrac{b}{rN}\]where:
- $T_s$ is seek time (ms or s)
- $r$ is rotational speed (rpm)
- $b$ is bytes to transfer
- $N$ is the number of bytes per track on the disk
Disk Arm Scheduling Algorithms
First Come First Serve (FCFS)
- Process requests as they come
- Fair (no starvation)
- Good for a few processes with clustered requests
- Deteriorates to random if there are many processes
Shortest Seek Time First (SSTF)
- Select request that minimises the seek time
- Generally performs much better than FCFS
- May lead to starvation
Elevator Algorithm (SCAN)
- Moves head in one direction
- Better than FCFS, usually worse than SSTF
- Avoids starvation
- Makes poor use of sequential reads (on down-scan)
Modified Elevator (C-SCAN)
- Like elevator, but reads sectors in only one direction
- Better locality on sequential reads
- Better use of read ahead cache on controller
- Reduces max delay to read a particular sector
Ext2
- Second Extended Filesystem
- Features:
- Block size (1024, 2048, and 4096) configured at file system creation
- Inode-based FS
Ext2 Inodes
- Mode
- Type
- Access mode
- UID
- User ID
- GID
- Group ID
- atime
- Time of last access
- ctime
- Time when file was created
- mtime
- Time when file was last modified
- Size
- Offset of the highest byte written
- Block count
- Number of disk blocks used by the file
- Direct blocks
- Block numbers of first 12 blocks in the file
Indirection
-------------------
| | -------- --------
| single indirect | -> | | -> | data |
| | -------- --------
------------------|
| | -------- -------- --------
| double indirect | -> | | -> | | -> | data |
| | -------- -------- --------
------------------|
| | -------- -------- -------- --------
| triple indirect | -> | | -> | | -> | | -> | data |
| | -------- -------- -------- --------
-------------------
Assuming 4 byte block numbers and 1K blocks, the max file size is calculated as $12 + 256 + 65536 + 16777216 = 16843020$ blocks $\approx$ 16GB.
Calculated from:
- Number of blocks addressable from one block = $(1K \ \text{or} \ 1024)/4 = 256$
- Direct blocks = $12$
- Single indirect blocks = $256$
- Double indirect blocks = $256^2$
- Triple indirect blocks = $256^3$
Single Indirect Block
A block number of a block containing block numbers.
Requires two disk access to read, one for the indirect block and one for the target block.
Double Indirect Block
Block number of a block containing block numbers of blocks containing block numbers.
Triple Indirect Block
Block number of a block containing block numbers of blocks containing block numbers of blocks containing block numbers.
Access Patterns
- To read 1 byte
- Best case: 1 read via a direct block
- Worst case: 4 reads via the triple indirect block
- To write 1 byte
- Best case: 1 write via direct block
- Worst case (if all blocks allocated): 4 reads + 1 write
- Worst case (if no blocks allocated):
- 4 writes (write all blocks)
- 1 read, 4 writes (read/write 1 indirect, write 2 indirect, write 1 data)
- 2 reads, 3 writes (read 1 indirect, read/write 1 indirect, write 1 indirect; write 1 data)
- 3 reads, 2 writes (read 2 indirect, read/write 1 indirect; write 1 data)
File System Reliability
- Disk writes are buffered in RAM, on crash or power outage data can be lost
- File system operations are non-atomic
e2fsck
- Scans the disk after an unclean shutdown and attempts to restore file system invariants
Ext3
- Third Extended Filesystem
- Features:
- Journaling
- Backward and forward compatibility with Ext2
Journaling
Journaling Block Device
Interface:
- Start a new transaction
- Update a disk block as part of a transaction
- Complete a transaction
- Commit a transaction by writing it’s data to the journal (persistent storage)
- Checkpoint a transaction by flushing the journal to the disk
Modes
Ext3 supports two journaling modes:
- Metadata+data
- Enforces atomicity of all file system operations
- Metadata journaling
- Metadata is journalled
- Data blocks are written directly to the disk
- Improves performance and enforces file system integrity
Memory Management
Fixed Partitioning
- Divides memory into fixed equal-sized partitions
- Any process $\le$ partition size can be loaded into a partition
- Partitions are free or busy
Problems
- Any unused space in the partition is wasted (internal fragmentation)
- Processes smaller than main memory, but larger than a partition cannot run
Dynamic Partitioning
- Partitions are of variable length
- Process is allocated exactly what it needs
Problems
- External fragmentation
Dynamic Partitioning Allocation Algorithms
Classic Approach
-
Represent available memory as a linked list of available “holes”
struct hole { unsigned int address; /* base address */ unsigned int size; /* size of memory hole */ struct hole *link; /* link to next memory hole */ }
-
Kept in order of increasing address
Dynamic Partitioning Placement Algorithm
First fit is generally the better performing algorithm, and easier to implement.
First Fit
- Scan the list for the first entry that fits
- If greater in size, break it into an allocated and free part
Next Fit
- Like first fit, but begins search from where the last request succeeded
- Performs worse than first fit as it breaks up the large free space at end of memory
Best Fit
- Chooses the block that is closest in size to the request
- Performs worse than first fit due to having to search complete list
Worst Fit
- Chooses the block that is largest in size
- Very poor performance
Compaction
Compaction is the shuffling of memory contents to place all free memory together in one large block. It generally requires hardware support.
External fragmentation can be reduced using compaction.
Swapping
A process can be swapped temporarily out of memory to a backing store, and then brought back into memory for continued execution.
Virtual Memory
Two classic variants:
- Paging (most dominant)
- Segmentation
Paging
60K - 64K | X | <- Virtual Page
56K - 60K | X |
52K - 56K | 3 | \
... -> |-----| 28K - 32K <- Page frame
4K - 8K | X | |-----| ...
0K - 4K | 5 | --> |-----| 0K - 4K
- Partition physical memory into small equal sized chunks called frames
- Divide each process’s virtual (logical) address space into same size chunks called pages
- The operating system maintains a page table to translate each virtual address to physical address
- Process’s physical memory does not have to be contiguous
- No external fragmentation
- Small internal fragmentation
- Allows sharing by mapping several pages to the same frame
- Abstracts physical organisation
Segmentation
- Supports user’s view of memory
- A program is a collection of segments (functions, methods, stack, arrays, variables, …)
- Logical address consists of a two tuple: (segment number, offset)
- A segment table has entries containing a base address and limit
- Allows sharing of segments
Page Faults
Referencing an invalid page triggers a page fault.
There are two main types:
- Illegal address
- Page not resident
Page Table
A page table is structured logically as an array of frame numbers. Each page table entry (PTE) also has other bits:
valid
bit- Indicates a valid mapping for the page
dirty
bit- Indicates the page may have been modified in memory
reference
bit- Indicates the page has been accessed
protection
bits- R/W/X or combinations of the three
caching
bit- Indicates whether the processor should bypass the cache when accessing memory
Two Level Page Table
Contains a root level page table, and a second level page table for each entry in the root level.
Virtual Address -------------------------v
------------------------------- | -------------------------
| 10 bits | 10 bits | 12 bits | ----- | Frame Number | Offset |
------------------------------- -------------------------
| | Second Level ^ |
| | ---------------- | | Main Memory
| | | | | | --------------
| ------------> | Frame Number | --- | | |
| Root Level | | | | | |
| ----------- | | | -------> | Page Frame |
| | | | ---------------- | |
|---> | Pointer | ---- | |
| | | |
----------- --------------
Inverted Page Table
An array of page numbers stored by frame number.
- Grows with size of RAM, not virtual address space
- Need a separate data structure for non-resident pages
- Saves a vast amount of space
Hashed Page Table
Similar to the inverted page table, but more efficient with its size based on physical memory size and not virtual.
Translation Lookaside Buffer
- Given a virtual address, the processor examines the TLB
- If matching PTE found (TLB hit), the address is translated
- Otherwise (TLB miss), the page number is used to index into the process’s page table
Properties
- Holds a recently used subset of page table entries
- TLB may or may not be under operating system control
- Typically 64 - 128 entries
- Can have separate TLBs for instruction fetch and data access
- Is a shared piece of hardware
- Entries are process specific
On a context switch, the TLB needs to be flushed so that all its entries are invalidated.
MIPS R3000
MIPS R3000 TLB
struct entry {
/* entry hi */
unsigned int vpn : 20; /* virtual page number */
unsigned int asid : 6; /* asid */
/* next 6 bits unused */
/* entry lo */
unsigned int pfn : 20; /* page frame number */
unsigned int n : 1; /* not cacheable bit */
unsigned int d : 1; /* dirty bit */
unsigned int v : 1; /* valid bit */
unsigned int g : 1; /* global bit */
/* next 8 bits unused */
};
MIPS R3000 Address Space Layout
kseg0
- 512 megabytes
- TLB not used
- Cacheable
- Only kernel-mode accessible
- Usually where the kernel code is placed
kuseg
- 2 gigabytes
- TLB translated
- Cacheable
- User mode and kernel mode accessible
- Page size is 4K
kseg1
- 512 megabytes
- Not cacheable
- Only kernel-mode accessible
- Where devices are accessed
Principle Of Locality
- Programs tend to reuse data and instructions they have used recently
- 90/10 rule: a program spends 90% of its time in 10% of its code
An implication of locality is that we can reasonably predict what instructions and data a program will use in the near future based on its accesses in the recent past
Temporal locality
Recently accessed items are likely to be accessed in the near future.
Spatial locality
Items whose addresses are near one another tend to be referenced close together in time.
Working Set
The pages/segments required by an application in a time window ($\Delta$) is called its memory working set.
Thrashing
Occurs when a computer’s virtual memory resources are overused, leading to a constant state of paging and page faults, inhibiting most application-level processing.
To recover from thrashing we can suspend a few processes to reduce degree of multiprogramming.
Page Replacement
In terms of performance, the policies below rank as:
- Optimal
- LRU
- Clock
- FIFO
Optimal Replacement Policy
Remove the page that won’t be used for the longest time.
- Impossible to implement
- A theoretic reference point
FIFO Replacement Policy
Remove the oldest page.
- Easy to implement
- Age of a page is isn’t necessarily related to usage
Least Recently Used Replacement Policy
Remove the least recently used page.
- Impossible to implement efficiently
- Implementation requires a time stamp to be kept for each page, updated on every reference
Clock Replacement Policy
When a page is used, set a reference bit to 1. When scanning for a victim, remove the first page with a reference bit of 0, and set all reference bits to 0.
I/O
Device Drivers
- Originally compiled into the kernel, nowadays they are dynamically loaded when needed
- Are classified into similar categories - block devices and character (stream of data) devices
- The operating system defines a standard interface to the different classes of devices
A device drivers job is to translate requests through the device-independent standard
interface (open
, close
, read
, write
) into appropriate sequence of commands (register
manipulations) for the particular hardware.
Programmed I/O
- I/O controller performs an action rather than the processor
- Known as polling or busy waiting
- Wastes CPU cycles
Interrupt Driven I/O
- Processor is interrupted when I/O controller is ready to exchange data
- No needless waiting
- Consumes a lot of processor time because every word read or written passes through the processor
Direct Memory Access
- Transfers data directly between memory and device, CPU not needed for copying
- Processor is only involved at the beginning and end of transfer
- Reduces number of interrupts
- Requires contiguous regions (buffers)
Buffering
Firstly, let
- $T$ be the transfer time for a block from device
- $C$ be the computation time to process incoming block
- $M$ be the time to copy kernel buffer to user buffer
No Buffering
- Process must read/write a device a byte/word at a time
- Process must what until each I/O is complete
- Inefficient
Cost: $T + C$
User Level Buffering
- Process specifies a memory buffer that incoming data is placed in until it fills
- Only a single system call, and block/wakeup per data buffer
- Could lose data if buffer is paged out to disk
Single Buffer
- Operating system assigns a buffer in kernel’s memory for an I/O request
- Swapping can occur since input is taking place in system memory
Cost: $max(T, C) + M$
Double Buffer
- Use two system buffers rather than one
- A process can transfer data to or from one buffer while the operating system empties or fills the other buffer
- May be insufficient for really bursty traffic
Cost: $max(T, C + M)$
Circular Buffer
- More than two buffers are used
- Each individual buffer is one unit in a circular buffer
- Used when I/O operation must keep up with process
Multiprocessors
Bus Based UMA
- More than one processor on a single bus connected to memory
- Bus bandwidth becomes a bottleneck with more than just a few CPUs
- Each processor has a cache to reduce its need for memory access
- Each processor has private local memory
- Keep private data off the shared memory bus
- Complicates application development
Cache Consistency
- Usually handled by hardware
- Writes to one cache propagate or invalidate to other caches
Multiprocessor Operating Systems
Individual Operating Systems
- Each CPU has its own operating system
- Simple to implement, avoids concurrency issues
- Consistency is an issue
Master Slave Multiprocessors
- Operating system runs mostly on a single CPU (CPU 1)
- User level apps run on the other CPUs
- All system calls are passed to CPU 1 for processing
- Little synchronisation required
- Single, centralised scheduler
- Memory can be allocated as needed to all CPUs
- Cross CPU traffic is slow compared to local
Symmetric Multiprocessors
- OS kernel run on all processors
- Concurrency issues across processors, solve using one of:
- Have 1 single “big lock” so only one CPU can be in the kernel at a time
- Identify largely independent parts of the kernel and make each of them their own critical section
MCS Locks
- Each CPU enqueues its own private lock variable into a queue and spins on it
- On lock release, the releaser unlocks the next lock in the queue
Scheduling
Preemptive And Non-Preemptive Scheduling
- Non-Preemptive
- Once a thread is in the running state, it continues until it completes, it blocks on I/O or voluntarily yields the CPU
- Preemptive
- Current thread can be interrupted by OS and moved to ready state
- Usually after a timer interrupt and process has exceeded its maximum run time
- Ensures fairer service as single thread can’t monopolise the system
Scheduling Algorithms
- Goals
- Fairness
- Policy enforcement
- Balance/efficiency
Interactive System Algorithms
- Minimise response time
- Provide proportionality
Real Time System Algorithms
- Must meet deadlines
- Provide predictability
Interactive Scheduling
Round Robin Scheduling
- Each process is given a timeslice to run in
- When the timeslice expires, the next process preempts the current process, and runs for its timeslice, and so on
Priorities
- Each process or thread is associated with a priority
- Provides a basic mechanism to influence a scheduler decision
- Usually implemented by multiple priority queues, with round robin on each queue
Single Shared Ready Queue
- When a CPU goes idle, it takes the highest priority process from the shared ready queue
Real Time Scheduling
- Correctness of the system may depend not only on the logical result of the computation but also on the time when these results are produced
- Not necessarily fast, but must be predictable
Given
- $m$ periodic events
- event $i$ occurs within period $P_i$ and requires $C_i$ seconds
then the load can only be handled if $\Sigma_{i = 1}^{m} \dfrac{C_i}{P_i} \le 1$.
Soft Real Time Systems
Must mostly meet all deadlines. For example:
- Washing machine runs slightly shorter or longer every so often
- Frames may get dropped at times in a multi media system
Hard Real Time Systems
Must always meet all deadlines. For example:
- An airbag system has to react in milliseconds
- A jet must adjust immediately to winds
Rate Monotonic Scheduling
- Priorities are assigned based on the period of each task
- The shorter the period, the higher the priority
Earliest Deadline First Scheduling
- The task with the earliest deadline is chosen next